Senior Thesis [EAS 499]
University of Pennsylvania, Fall 2019
Alex Tianheng Zhao
alexzhao@seas.upenn.edu
Department of Computer and Information, School of Engineering and Applied Science, University of Pennsylvania
Department of Statistics, The Wharton School, University of Pennsylvania
personal website,github,ORCID
Chris Callison-Burch, PhD
ccb@upenn.edu
Department of Computer and Information Science (SEAS), University of PennsylvaniaLyle Ungar, PhD
ungar@cis.upenn.edu
Department of Computer and Information Science; additional appoints in the Departments of Bioengineering (SEAS); Genomics and Computational Biology (Penn Medicine); Operations, Informations, and Decisions (Wharton); Psychology (SAS), University of Pennsylvania
Max Mintz, PhD
mintz@cis.upenn.edu
Coordinator, Senior Thesis Program, Department of Computer and Information Science
University of Pennsylvania
substance column
Using natural language processing (NLP) techniques on approximately 20,000 reports of transcendent experiences, we examine and better understand the dimensions of non-ordinary states of consciousness. The primary data source is the Vaults of Erowid, which consists in total of approximately 35K experience reports. Generalizable pipelines for machine learning and natural language processing are first discussed, providing a high level overview of the standard processes and tools available to the modern data scientist. We then perform extensive exploratory analysis to understand the data, after which we identify the hidden topics of trip reports through topic modelling, examine clusters of similar experiences, and visualize the dimensions of subjective experiences through word clouds and other techniques. Frameworks for future directions such as predicting substance(s) consumed, trip reports text generation, and feature importance considerations are outlined. Numerous machine learning and NLP techniques are surveyed, and discussed theoretically while noting how they can be applied to future research. Finally, we place this research in historical and academic context, and make a proposal for a more connected and blissful future.
For link to the codebase(which will be continuously updated), which also contains an executable jupyter notebook, see the public project github repository
Psychedelics (from the Greek psyche: mind, delos: make visible, reveal; together meaning “mind manifesting”), are “powerful psychoactive substances that alter perception and mood and affect numerous cognitive processes” (Nichols, 2016). The most well known “classical psychedelics”, some of which have been used by indigenous cultures for thousands of years, activate the serotonin system, in particular by agonizing the 5HT-2A serotonin receptor subtype (Nichols, 2016). These classical psychedelics include Lysergic Acid Di-amide (LSD, semi-synthetic psychedelic derived from the ergot fungi), Psilocybin (inactive prodrug to the psychoactive Psilocin, found in certain mushrooms), Di-mythyltryptamine (DMT, the active ingredient in the powerful, indigenous Amazonian brew Ayahuasca), and Mescaline (found in Peyote Cactus). Newer psychedelics such as 2-CB, 5-Methoxy-Dimethyltryptamine (5-Meo-DMT, found in multiple plant species and most famously the Sonoran Desert Toad) are also often classified under the term "hallucinogen" or "psychedelic". Other related drugs such as 3,4-Methylenedioxymethamphetamine (MDMA) are often referred to as psychedelic drugs, but are more correctly classified as an empathogen or entactogen (Nicholas, 2016) (meaning “touching within”, from the Greek en: within and the Latin tactus: touch and the Greek -gen: produce)
On September 4th, 2019, Johns Hopkins University launched the Johns Hopkins Center for Psychedelics and Consciousness Research, with $17 million dollars in research funding (Johns Hopkins University, 2019). The center is thought to be the largest of its kind in the world, and sets to study addition, PTSD, addiction, depression, anorexia nervosa, trauma, and other mental “illnesses” with psychedelics and traditional forms of therapy.
After decades of suppression and moral hazard, psychedelics are making a return to legitimate research as a novel and surprisingly effective treatment modality for multiple extremely difficult conditions, including but not limited to end of life depression and anxiety (Griffiths et al., 2016; Osório et al., 2015), treatment-resistant depression (Carhart-Harris et al., 2017), post traumatic stress disorder (PTSD) (Mithoefer et al., 2011; Doblin, 2002), alcohol and nicotine addiction (Nichols, 2016), cluster headaches (Sewell, Halpern, & Pope, 2006), obsessive compulsive disorder (Moreno, Wiegand, Taitano, & Delgado, 2006), and shows early promise for sexual and racial trauma and alzheimer’s. Psychedelics have also shown dramatic results for “well-people”, including being able to reliably induce “mystical-type experiences” (R. R. Griffiths, Richards, McCann, & Jesse, 2006), increase creativity (anecdotally, see Nichols, 2016), and is correlated with greatly enhanced feelings of religious devotion (Pahnke, 1963). Beyond clinical studies, animal studies with octopi have demonstrated MDMA increasing feelings of social connectedness (Edsinger & Dölen, 2018). In vitro studies have shown enhanced neuroplasticity (Victor Acero, preprint). Brain imaging studies, primarily out of Imperial College London and University College London, have shown psychedelics quiet the default mode network (R. L. Carhart-Harris et al., 2012), induce hyperconnected and entropic brain states (Robin L. Carhart-Harris et al., 2014), and proposed a mechanism for action for psychedelics, formalized as the Free Energy Principle and RHEBUS (R. L. Carhart-Harris & Friston, 2019). Selena Atasoy et. al of Oxford University has also recently demonstrated that the brain under psychedelics has more high energy resonant states, while remaining hyper-coherent and highly orchestrated, using a novel technique called Connectome Specific Harmonic Waves (CSHW) (Atasoy, Donnelly, & Pearson, 2016). The Qualia Research Institute (QRI) has also recently proposed the Symmetry Theory of Valence (STV), which posits that symmetries in resonant brain states as modelled by a high dimensional mathematical objects, is correlated with experiential valence (Johnson, 2016).
With all the exciting work being done in the psychedelics landscape (call it “psychedemia”), be it clinical, neuroscientific, or anthropological, little work has been done on understanding what the psychedelic trip actually consists of, and understanding qualitatively and quantitatively the dimensions of the psychedelic subjective experience. Current work by Imperial College and Oxford (pending publication) are showing promising results for understanding both the macro and micro phenomenology of psychedelic experiences. Using natural language as an inadequately wanting proxy, we can begin to understand the hallmark subjective signatures of different substances, and potentially identify novel neuro-targets for clinical interventions to reduce suffering, as well as gain a deeper understanding of mechanisms of action for psychedelic substances, if such a mechanistic understanding is even possible. It is the sincere hope that through this work, we may begin to demystify the subjective dimensions that characterize psychedelic experiences, and lay the foundations for more informed clinical and societal support structures that may alleviate unnecessary mental suffering and enrich the lives of well-people.
This thesis seeks to extend the current work by using techniques from machine learning, and in particular natural language processing, on a corpus of psychedelic trip reports retrieved from multiple sources, including chiefly, the Vaults of Erowid. We will present the results factually, with minimal layers of interpretation and meticulous documentation for high reproducibility, clarifying our assumptions and processes whenever appropriate. It is our hope that this work may be open-sourced and somehow beneficial to clinical researchers, psychologists, neuroscientists, machine learning engineers, social scientists, and/or the lay people community as a whole. If not for some deep, metaphysical takeaway, perhaps for some lighthearted, jovial pass time.
It should also be noted that this thesis focuses heavily on the theoretical foundations of select natural language processing techniques, and demonstrates (not necessarily very deeply) the potential applications of these techniques to a psychedelic corpora. I am a firm believer in understanding things deeply in order to explain them simly, and with solid theoretical foundations, applications arise naturally, and various techniques can be mixed and matched as long as the assumptions are clear.
In the spring of 2019, Victor Acero, Rahul Sood, Margaret < last name omitted > and I cofounded the Penn Society for Psychedelic Science (PSPS) and scrambled together the inaugural Intercollegiate Psychedelics Summit, IPS 2019, at the University of Pennsylvania. IPS 2019 drew about 150 students and 8 speakers, including Frederick Barrett and Manoj Doss from JHU's Center for Psychedeelics and Consciousness Research (Psychedelics Research Unit at the time) and David Nutt, president of the European Brain Council and researcher at Imperial College London's Psychedelics Research Unit. Feeling slightly burnt out from planning the summit in 4 months, we took the summer to recharge, with Victor Acero and Emily Cribas (now PSPS Operational Director) hosting 2 journal clubs over the summer and the remaining cofounders in different corners of the world, recuperating from a very intense 4 months.
Come fall of 2019, we reconvened with great enthusiasm, and collectively ratified PSPS's constitution and bylaws, estalished the Board of Directors structure, and welcomed 6 new team members onto the PSPS Council, which supports the PSPS General Member Body as a whole. One of the new members was Yev Barkalov (@yevbar), who is one of the most talented software engineers and self-proclaimed hackers I know, having founded Hack the Fog (San Francisco's first high school hackathon) and attended 30+ hackathons just for fun. Around late September, we began working on establishing PSPSLabs, which is an innovative lab for psychedelic projects within the umbrella of PSPS. Think GoogleX, but for psychedelics and consciousness related projects. We were setting up PSPSLabs to provide a nest for potentially impactful, consciousness expanding projects to grow and blossom, and our hope was that through our work, we may enrich the experiences of those around us in safe and responsible ways. With the ethos of silicon valley and a deep, humbling belief that the responsible marrying of technology and science can potentially bring positivity and value to those around us, we began brainstorming projects we could work on for fun to share with the world. Yev and I began calling regularly for hours at a time, sharing resources, chatting about the recent developments in psychedelia, and coming up with random ideas.
One of the projects that blossomed into our shared consciousness on one such call was EveryQualia, a search engine for psychedelic trip reports. We were super curious what the experiences of others were like on psychedelic drugs. We began looking at the Johns Hopkins and Imperial College London survey studies, which had super interesting findings such as how the subjective rating of ego dissolution correlated with the quieting of the default mode network (R. L. Carhart-Harris et al., 2012), and how a single high dose psilocyin session let to significant decreaes in depressive symptons (as measured by survey results). After looking at a few studies, we began wondering — is there more out there? The surveys were good because they were rigorous and comparable between subjects, but a few numbers and scales could hardly capture the ineffability, noetic quality, and richness of psychedelic experiences! We became extremely fasincated by free style trip reports, unburdened by the limiting and valve reducing nature of surveys. Alas, after poking around for a while, we realized that there was no good search engine for psychedelic trip reports! There were various reddit threads and personal stories scattered here and there but no centralized data source. Then we found Erowid, which contained a massive archive of about 35,000 trip reports (as of Dec 9, 2019). The Vaults of Erowid was a treasure trove of information. Indeed, Erowid is one of the oldest and most trusted sources for information about hundreds and thousands of drugs, psychedelics and otherwise. We thought we could start by scraping Erowid and saving the trip reports to a database, build a simple webapp frontend, and use something like Algolia (search as a service API) to provide smooth realtime search functionality of the massive Erowid trip report. And just like that, EveryQualia V0 was born, live with about 10K trip reports (we didn't have any money, and Algolia was limited to 10K data objects. Also: if Earth and Fire Erowid are reading this, we apologize for bombarding your serves with our scrapers, and note we totally deserved to be blocked).
Yev became the driving technological powerhouse behind EveryQualia while I dipped deeper into Machine Learning and Natural Language Processing techniques, having taken CIS 350: Computational Linguistics with Chris Callison-Burch in the spring of 2019 and now taking CIS 520: Machine Learning with Lyle Ungar. With version 0 of our webapp, we had real time search and a groovy interface to read trip reports, which was awesome, but we still didn't have a way to visualize any large scale patterns that were present in the corpus. Since machine learning (ML) is all about learning generalizable patterns from massive amounts of data, applying ML and in particular NLP techniques psychedelics report corpus seemed like a natural extension. Not only would writing this thesis deepen my understanding of the theoretical foundations of machine learning, it also felt like a valuable launching pad for NLP technology that we plan to integrate into EveryQualia. I will say, at this point in a time, just how grateful I am to have two mentors and past professors to sign on as advisors to this rather unconventional thesis. Thank you.
A research team of the Chemical Youth Project at University of Amsterdam published a fantastic visualization in 2016. This project analyzed 20,534 reports, focusing on a select subset of substances, and analyzed the demographics, language, sustance co-use, and dosage trends in the Erowid Corpus. As of Dec 9, 2019, their website is not live for unkonwn reasons, but a July 23, 2019 archived version can be found here. The visulizations done by this team is absolutely stellar, and the reader is invited to check out their work before reading this thesis to get a better feel for the Erowid Dataset.
There is no single mold of an audience for which this thesis is written. A reader who is familiar with the basics of linear algebra and statistics, in particular the basic concepts and notations of vectors, matrices, and probability distributions will find the thesis straighforward to follow especially at times when there are slight conceptual leaps. A few sections contain more advanced probability theory, but those can be skipped without sacrificing the overall narrative structure of the thesis. No prior knowledge is assumed about techniques in machine learning and natural language processing (NLP), as the foundations will be presented from first principles by first relying on intuition and supported by mathematics for those interested. The final sections of the thesis consists of musings, extensions, and bonus content for the curious reader, and contain a few non-technical elaborations intertwined with technical notes.
This thesis was as much as an empirical exercise as it was a theoretical exercise, written in a rather didactic manner as an self-exercise to test my own understanding of the material and ability to convey not only what was done but also the theory of why what was done is appropriate.
Perhaps more important than prior knowledge for the reader is an innate curiosity, a child-like delight in exploring new frontiers; one who is drawn to principles of foundational interconnectedness and not afraid to embrace concepts that seem at first challenging for the unitiated, yet delightful for the persistent, may find this thesis enjoyable.
A note: due to scope and time constraints, a few state of the art techniques such as finetuning BERT and GPT 2 language models were researched and planned, but not fully explored. I have left those sections in the table of contents, a deference to what was conceived, and a gentle reminder to future self that even a thesis is a work in progress and doesn't have to be absolutely "perfect".
The field of machine learning is very broad, and indeed a complete overview of what machine learning would fill a complete tome and certainly out of scope for this thesis. That being said, there are a few generalizable insights and principles that are very important to mention. They are highly inspired by Professor Pedro Domingos's excellent paper, A Few Useful Things to Know about Machine Learning and my work in progress computer science and statistics education
This thesis employs the CRISP-DM (Cross Industry Standard Process for Data Mining) pipeline, which is a widely used open standard by professional data scientists as a standard data mining approach
CRISP-DM Model Infographic Source
The algorithms chosen in the Modelling stage can significally impact conclusions. Many papers use black box algorithms and present results as if they are facts, but different initial conditions and design decisions can very quickly change "facts" into approximations. Indeed, machine learning is all about being approximately sure of things, an intuition formalized by probability distributions and statistical esimtators.
The first step to any machine learning or natural language processing project is to acquire the data. In this section, we describe the data source and data scraping process.
N.B.: This codeblock exists to switch between Jupyter and Google Colab developement environments, and is applicable only to readers who wish to run the notebook. Change the root_path variable to the appropriate path in order to reproduce the code. Additional work may be required to run the code without glitches. Readers are invited to submit pull requests to contribute to the codebase.
# toggle between local and Google Colab Development Environment
# Uncomment below if using colab environment
# from google.colab import drive
# drive.mount('/content/gdrive')
## if developing locally, set is_colab to False
# is_colab = False
# root_path = '/content/gdrive/My Drive/UPenn/UPenn_fall-2019/[EAS 499] Senior Thesis f2019' if is_colab else "." #change dir to your project folder
# print("root_path now set to {}".format(root_path))
With v0 of EveryQualia, we downloaded the May 2012 Erowid Archive from http://de1.erowid.org/erowid_archives/, which is the most recent archive as of Dec 9, 2019. This download contains the entire mirror of the Erowid website from May 1st, 2012, and contains 19924 experience reports, each as its own html document. @Yevbar maintains a mirror of all the trip report raw files on his github, at Old Erowid Scraper.
One of the earliest and biggest unknowns for this thesis was whether there was going to be enough data on which to perform any meaningful analysis, since machine learning algorithms typically require a sizable amount of data. Early conversations with Lyle Ungar and Chris Callison-Burch suggested that a corpus of 200-300K reports was desirable. Since psychedelics are still, as of Dec 9, 2019, a Schedule 1 controlled substance, reports of psychedelic expereinces are not very easy to find, as every published account after the early 1970's is tantamount to admitting a federal crime. As such, a considerable amount of effort that is not necessarily documented here was really looking for sources for psychedelic trip reports. Feelers to the Johns Hopkins team revealed that the session reports in their clinical trials are private and not available to use for this thesis, and conversations with friends in the industry regarding an upcoming Johns Hopkins global survey study that would be the largest of its kind in the world revealed that it was still in IRB stages. Conversations with researchers at University of Oxford, who are also doing fascinating work with natural language processing on transcendent experiences accounts, revealved that their studies are based on in person interviews and their transcriptions, which the researchers personally and painstakingly obtained. Further details about this study cannot be shared in this thesis as their work serves as the basis of an upcoming PhD thesis, the details of which is not yet public as of this writing. In parallel Penn Society for Psychedelic Science, Harvard, Princeton (and 14 other schools, as of Dec 9, 2019) are growing the Intercollegiate Psychedelics Network (IPN). Multiple conversations were had with other student leaders about setting up an Intercollegiate Psychedelics Archive, or IPN Archives, containing not only our respective student club files but also serve as a central source for aggregating trip reports from students. Indeed, there were some initial efforts of making this happen, but due to legal considerations, we decideed to hold off for now until the legal climate was more amenable, for our personal safety and the wellbeing of our respective student organizations.
And such, efforts were directed to other online sources, such as various reddit forum, and informal messaging boards, but none of which contained enough data at high enough quantities and structure as Erowid. David Yaden, a PhD candidate with Martin Seligman and Lyle Ungar, is doing fantastic work in assembling The Varieties Corpus, which is not yet available to the public and researchers. A natural extension of this thesis is to incorporate David's data in future work.
As my conversations with Yev progressed, he decided to try once again scraping Erowid live. Note that Erowid has pretty strict guidelines for crawlers, including:
Running a crawler at full speed (10 pages / second) non stop, crawling all 35,000 trip reports currently live (as of Dec 9, 2019), would take a full hour. This doesn't seem like a lot, but in reality the crawling limits are much lower, including a 1000 pages / day limit, and in practice we got repeatedly blocked by Erowid and it often took several hours to crawl 1000 pages. We entertained the idea of spinning up multiple AWS EC2 instances and using a distributed architecture to crawl Erowid (Yev actually wrote the code, and the idea was that each crawler would have a different IP address to "fool" Erowid). But this would be dishonest, and having not received an email reply from Erowid plus a variety of other reasons, I decided to not to proceed with scraping the entirety of Erowid. Indeed, this is an extension for further research.
Some important links about using the Erowid Experience Vaults:
In the end, we decided to stick with the corpus of EveryQualia v0, which was the May 1, 2012 archive of the full Erowid website from http://de1.erowid.org/erowid_archives/. To extract the data into a usable format, we unarchived the disk image, mounted it to disk, and navigated to the folder containing 19934 html files (see here). The following script was used to extract the report body, report title, and report substance(s) into a csv file:
import csv
from lxml import html
from os import listdir
from os.path import isfile, join
def trip_reports_to_csv():
trip_csv = "trips.csv"
columns = ["report", "title", "substance"]
mypath = "experiences"
onlyfiles = [join(mypath, f) for f in listdir(mypath) if isfile(join(mypath, f))]
with open(trip_csv, "w") as trip_csv:
trip_writer = csv.writer(trip_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
trip_writer.writerow(columns)
for f in onlyfiles:
with open(f, "r", encoding='utf-8', errors='ignore') as f_:
try:
page = f_.read()
tree = html.fromstring(page)
report = ''.join(tree.xpath('//div[@class="report-text-surround"]/text()')).strip()
title = tree.xpath('//div[@class="title"]/text()')[0]
substance = tree.xpath('//div[@class="substance"]/text()')[0]
trip_writer.writerow([report, title, substance])
except:
pass
trip_reports_to_csv()
The package lxml.html was used to match xpaths, with credits to Yev Barkalov for the xpaths that correctly matched the proper div elements for desired sections. In all, the script took about 1 hour to run on our hardware (rough estimate). It should be noted that there was certain demographics information such as gender, age, and publication date that were available (See Section 2.4, Previous Work), but were not included in the analysis of this thesis. This certainly hold promise for future research.
In total, there were 19924 substance reports, each having a report body, a title, and a substance field.
import pandas as pd
trips_raw_filepath = '{}/trips.csv'.format(root_path)
pd_tripReports = pd.read_csv(trips_raw_filepath)
pd.set_option('display.width', 80)
# pd_tripReports.shape
# pd_tripReports.info()
# First 6 entries of the data
pd_tripReports.head()
# Last 6 entries of the data
pd_tripReports.tail()
# An example trip report
report_example = pd_tripReports.loc[9000]
report_example["title"]
print("Title: {}".format(report_example["title"]))
print("Substance: {}".format(report_example["substance"]))
print("Report Body: {}".format(report_example["report"]))
Now that we have a collection of documents, i.e.: transcendent experience reports, obtained from the Erowid Experience Vaults, we can continue with the remainder of our analysis. Nautural Language Processing (NLP) is a subfield of machine learning, and deals primarily with unstructured textual data. The above graphic outlines the high level standard workflow for an NLP project, which includes:
Text Pre-processing: since the web is messy and unpredictable, we need to clean our data appropriately and perform some pre-processing to prepare our data for subsequent analysisText Parsing & Exploratory Data Analysis: what structures exist in the text, such as parts of speech or named entities (e.g.: TOM can be an organization (ORG) or a person (PER) depending on context) - text parsing helps us construct better understand underlying structures of the text; exploratory data analysis (EDA) is the process of playing and understanding the data. It's as if the data is a person, and you're asking it questions over coffee to get to know them better. No particular way of conversation is necessarily better than another, and data scientists can rely on tried and true techniques or get creative.Text Representation & Feature Engineering: The chosen representation of the data, as well as the features extracted from the raw data, are extremely important. Just as a landscape artist needs to choose the angle and form of the mountain he is painting, selecting the scenes and objects (features) to include to represent spring time, a data scientist needs to choose the method(s) of representing the text, and select features to highlight. More frequent than not, a "feature" in natural language processing is a vector of integeters or real numbers (e.g.: counts of words per document). Modeling and / or Pattern Mining: depending on the task at hand, and the data (i.e.: metadata, labels etc.) available to us, we can use either supervised or unsupervised techniques to analyze our data. These techniques include but are not limited to topic modelling techniques, clustering (to find similar groups), and dimensionality reduction techniques (to visualize the data). A substantial portion of the thesis examines the upon the above techniques, with particular emphasis on topic modelling.
Evaluation & Deployment: Once we have a model, we need to have a way to understand how well it performs. That is the topic of Evaluation, a crucial part of the natural language processing and machine learning pipeline. Without sound evaluation metrics, we have created a model, not a good model. Indeed, unsupervised learning techniques are typically hard to evaluate, and note that improving evaluation metrics is a further area of research.
Note that the workflow above has been drawn linearly, but the NLP pipeline in reality has many loops — extracting features, finding patterns, and asking questions about the data do not occur in a vaccuum. We present our work below linearly for structure, but the process of getting to know the data and writing this thesis was far from linear.
Text Preprocessing refers to a broad class of techniques that pre-processes the text before any analysis is done. It's analogous to a rough sieve filtering out the largest rocks, cleaning and normalizing the corpus on a high level. Note that there is no "canonical process" to perform text pre-processing. In practice, the types of preprocessing done depends on the type of data, the quantity of data, trial and error. Steps marked with * seem to be those that are most frequently used.
* Remove extra spaces* Make all text lowercase* Remove stopwords* Tokenization* Normalize accented characters* Remove special charactersRemove rouguehtml/xmltagsRemove numbersExpanding contractionsStemmingLemmatizationFor those interested, here is some more detail on each step:
* Remove extra spaces: a whitespace is not just a whitespace in NLP; they can be
spaceor\tab, and depending on the parse, one needs to deal with newline characters such as\nor\r. Stripping away unecessary whitespace to find word boundaries is a crucial step in pre-processing. This is most commonly done withstr.strip()
* Make all text lowercase:should
wowbe treated the same asWOW? In sentiment analysis tasks, capitalizations matter very much, as the presence of caps often connote delight. In other tasks that are more informational based, capitalization add noise and unwanted dimensionality. This is most commonly done withstr.lower()
* Remove stopwords: words such as
aandtheandandandisoccur so frequently in the english language that they often don't add any meaningful information to the task at hand, so we pre-empt noise by adding them to astopwordslist and remove them from our corpus. Care must be taken with stopwords as by adding a word to a stopwords list we are a priori claiming that word does not add value. Indeed, men tend to usethemore than women, so a classifier that predicts gender would be well served to include the presence or counts oftheas a feature (Lyle Ungar).
* Tokenization: the process of dividing a big chunk of text into individual units is called tokenization. In English, this is fairly straightforward: use individual words as tokens, or individual characters if we're building a character-level model. We also have to consider whether to make puntuations tokens, and whether we include any ngrams, e.g.: a bigram such as
cat jumpedor trigram such asyou are beautifulseparately as tokens, in addition to the individual wordscat,jumped,beautifuletc. Tokenization is considerably harder for languages that don't have spaces, such as Chinese. In practice, we use built in tokenizers such asnltkorspaCy
* Normalize accented characters:e.g.:
átoa. Useful for english, but for other languages accented characters may have special meaning and usage. This is most commonly done withunicodedata.normalizeand built insklearnfunctionality
Remove rougue html / xml tags:websites are written in
html(xml),css, andjs; If a website is a human being,htmlis the skeleton and meat,cssis the makeup, andjsprovides the instructions for interactivity. As we know, sometimes bones break, but the human is still functioning — likewise, anhtmlorxmlfile might be slightly corrupted, but the browser will still render the webpage. After we've scraped the webpage, we remove thosehtmltags that are extraneous, so all we are left is the meat. This is most commonly done with packages such asBeautiful Soup
Remove numbers:if we were designing a system to answer questions like
what is the population of norway, numbers in the original data set is crucial. Other times, number add only noise. Depending on the problem at hand, the data scientist makes an informed decision of whether to remove numbers This is most commonly done withregex, or regular expressions, highly structured patterns that "match" to patterns of text. For example, theregexthat matches to a 1 or more numbers is[0-9]+
Expanding contractions:e.g.:
don'ttodo not. Once again, this is optional and depends on the task and availability of data. Lyle Ungar remarks this has been highly optional in his work.
* Remove special characters:do special characters such as
<,!,#add value? Indeed, sentiment analysis algorithms often look for!, and if we were analyzing twitter data,#are important. In other cases, special characters are treated as gibberish and removed.
Stemming:
Stemmingrefers to the often crude procedure of removing the ends of words to find the "word stem"; e.g.: walks -> walk;Porter Stemmer,Lovins Stemmer,Paice Stemmerare three examples of commonly used stemming algorithm. SomePorter Stemmerrules are found below (source)![]()
Stemming Graphics SourceHere's an example of Porter Stemming rules, from the Stanford IR book (Manning, Raghavan, Hinrich Schütze, & University Of Cambridge, 2009)
Lemmatization:
Lemmatization, to quote verbatim Stanford's Information Retrival Texbook (Manning, Raghavan, Hinrich Schütze, & University Of Cambridge, 2009), "usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma". In other words, lemmatization returns the "cannonical" form of a word. Some examples of lemmatization: (source)
- rocks : rock
- corpora : corpus
- better : good; In a very real sense, lemmatization is a more complex procedure and can often produce more desirable results
However, it is interesting to note that with sufficiently large datasets in industrial machine learning applications (such as those done at Google), stemming and lemmatization are uncommonly used (source: in conversation with Lyle Ungar)
news_df.to_csv('news.csv', index=False, encoding='utf-8'), OR use pickle dump, for ease of retrieval and use laterNote: Text preprossing procedures draws inspiration from here and A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text, as well as Professor Chris Callison-Burch's Computational Liguistics Class.
# Heavily adapted from: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72
### Getting the right packages
# for Colab only; OR, after installing SpaCy need to download the language models
# !pip install spacy
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md
# !python -m spacy download en_core_web_lg
import pandas as pd
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
import unicodedata
# From Spacy: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.
# nlp_spacy = spacy.load('en_core_web_md', parse=True, tag=True, entity=True)
tokenizer = ToktokTokenizer()
print("Section 4.2.2 Text Pre-processing: done loading packages")
##### Recall the list of preprocessing tasks
# - `* Remove extra spaces`
# - `* Make all text lowercase`
# - `* Remove stopwords`
# - `* Tokenization`
# - `* Normalize accented characters`
# - `Remove rougue `html`/`xml` tags`
# - `Remove numbers`
# - `Expanding contractions`
# - `Stemming`
# - `Lemmatization`
# Option to reimport the data
pd_tripReports = pd.read_csv(trips_raw_filepath)
# make text lowercase
pd_tripReports["report"] = pd_tripReports["report"].str.lower()
pd_tripReports["title"] = pd_tripReports["title"].str.lower()
pd_tripReports["substance"] = pd_tripReports["substance"].str.lower()
pd_tripReports.head()
# Prepare stopwords
# Note here we only use the stopwords from the nltk package;
# see section 4.2.3 for a more in depth treament of stopword removal
import nltk
nltk.download('stopwords')
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')
# Remove stopwords
def remove_stopwords(text, stopword_list, is_lower_case=False):
tokens = tokenizer.tokenize(text)
tokens = [token.strip() for token in tokens]
if is_lower_case:
filtered_tokens = [token for token in tokens if token not in stopword_list]
else:
filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
filtered_text = ' '.join(filtered_tokens)
return filtered_text
# remove_stopwords("The, and, if are stopwords, computer is not", stopword_list) # example
# * Normalize accented characters
def remove_accented_chars(text):
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return text
# remove_accented_chars('Sómě Áccěntěd těxt') # example
# Remove rougue `html`/`xml` tags
def strip_html_tags(text):
soup = BeautifulSoup(text, "html.parser")
stripped_text = soup.get_text()
return stripped_text
# strip_html_tags('<html><h2>Some important text</h2></html>') # example
# Remove Special Characters
def remove_special_characters(text, remove_digits=False):
pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
text = re.sub(pattern, '', text)
return text
# remove_special_characters("Well this was fun! What do you think? 123#@!",
# remove_digits=True) # example
# Simple Stemmer based on Porter Stemmer
# Lyle Ungar recommends not using a stemmer
def simple_stemmer(text):
ps = nltk.porter.PorterStemmer()
text = ' '.join([ps.stem(word) for word in text.split()])
return text
# simple_stemmer("My system keeps crashing! his crashed yesterday, ours crashes daily") # example
# Lemmatize Text
# Lyle Ungar recommends not using a lemmatizer
def lemmatize_text(text):
text = nlp(text)
text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
return text
# lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily") # example
# Adapted from https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72
def normalize_corpus(corpus, html_stripping=True,
accented_char_removal=True, text_lower_case=True,
special_char_removal=False, text_lemmatization=False,
text_stemming = False, stopword_removal=True,
remove_digits=True, contraction_expansion = False):
"""
Input: corpus, list of documents
Output: normalized_corpus, list of normalized documents that have been preprocessed
"""
normalized_corpus = []
# normalize each document in the corpus
for doc in corpus:
# strip HTML
if html_stripping:
doc = strip_html_tags(doc)
# remove accented characters
if accented_char_removal:
doc = remove_accented_chars(doc)
# expand contractions
if contraction_expansion:
doc = expand_contractions(doc)
# lowercase the text
if text_lower_case:
doc = doc.lower()
# remove extra newlines
doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
# stem text: caution when stemming + lemmatizing text at the same time
if text_stemming:
doc = simple_stemmer(text)
# lemmatize text
if text_lemmatization:
doc = lemmatize_text(doc)
# remove special characters and\or digits
if special_char_removal:
# insert spaces between special characters to isolate them
special_char_pattern = re.compile(r'([{.(-)!}])')
doc = special_char_pattern.sub(" \\1 ", doc)
doc = remove_special_characters(doc, remove_digits=remove_digits)
# remove extra whitespace
doc = re.sub(' +', ' ', doc)
# remove stopwords
if stopword_removal:
doc = remove_stopwords(doc, stopword_list, is_lower_case=text_lower_case)
normalized_corpus.append(doc)
return normalized_corpus
## Preprocess our corpus
## Long run time: ~ 2 min
# pd_tripReports.info()
# pd_tripReports.dtypes
pd_tripReports["report"] = pd_tripReports["report"].astype('str')
pd_tripReports["title"] = pd_tripReports["title"].astype('str')
pd_tripReports["substance"] = pd_tripReports["substance"].astype('str')
corpus_list = pd_tripReports["report"].to_list()
corpus_list_normalized = normalize_corpus(corpus_list)
pd_tripReports_normalized = pd_tripReports
pd_tripReports_normalized["report"] = pd.Series(corpus_list_normalized)
# example trip report not preprocessed, first 400 chars
pd_tripReports.loc[19922, "report"][:400]
# example trip report preprocessed, first 400 chars
pd_tripReports_normalized = pd.read_csv("trips_normed-html-accent-special-digits.csv")
pd_tripReports_normalized.loc[19922, "report"][:400]
# OPTIONAL
# save preprocessed file to disk
# pd_tripReports.to_csv('{}/trips_normed-html-accent-special-digits'.format(root_path), index=False, encoding='utf-8')
Because the pre-processing, especially the removal of stopwords takes a substantial amount of time to run, we store a serialized version of the object so we may retrive it later at a substantially faster rate; see pickle tutorial
## OPTIONAL
# Use this cell as a template to save and load large data files;
# replace filenames appropriately;
# Pandas data frames can be saved directly to csv files with `pd_df.to_csv("filename")`
# import pickle
# ### Stores the tokenized reports:
# with open('', 'wb') as out_file:
# pickle.dump(reports_text_tokenized, out_file)
# ### Loads the tokenized reports:
# with open('reports_tokenized_map', 'rb') as in_file:
# reports_text_tokenized = pickle.load(in_file)
The removal of stopwords (typically commonly occuring, low information content words) in natural language processing is in itself an art. Both automatic detection of stopwords (Wilbur & Sirotkin, 1992) and dictionary approaches have been widely used. For example, nltk has 179 stopwords , wordcloud package has 192 stopwords, and the sklearn package has 318 stopwords, with 116 words shared between all three lists. Some examples of these shared stopwords are 'how', 'itself', 'an', 'ours', 'who', 'had', 'once', 'before', 'yours', and 'the', which are among the most common words in the English language.
In the previous text pre-processing step, we simply used the 179 stopwords from nltk. Here, we prepare an extended list of stopwords, to be used selectively downstream in topic modelling (Section 6) and word cloud generation (Section 7). The extended stopwords list is prepared by (1) using nltk and wordcloud stopwords as a baseline (exclude sklearn stopwords because of known issues and (2) adding all alternative spellings of drugs observed from EDA to our stopwords list; Note that the latter is an optional, concious design decision: we can leave the drug names be (in which case they would show up disproportionately in the word clouds), or we can remove all drug names from the reports and focus on non drug names that correlate with particular reports. In total, our psychedelic drug corpus has 2783 unique spellings / mispellings of drug names. In total, our extended stopword list consisits of 2999 unique stopwords.
# Retrieve all alternative spellings for substances
# Using R, we found out how many unique substance spellings there were, and stored in `substances_unique` textfile
with open('substances_unique') as f:
substances_all_alternative_spellings = f.readlines()
substances_all_alternative_spellings = [spelling.strip().replace("'", "") for spelling in substances_all_alternative_spellings]
substances_all_alternative_spellings[:30]
## prepare stopwords
stopword_list = []
stopwords_nltk = nltk_stopwords.words('english')
stopwords_wordcloud = list(wordcloud_STOPWORDS)
stopword_list.extend(stopwords_nltk + stopwords_wordcloud + substances_all_alternative_spellings)
# deduplicate list
stopword_list = list(dict.fromkeys(stopword_list))
# How many unique stopwords in total do we have?
len(stopword_list)
# Number of stopwords for each dictionary
from sklearn.feature_extraction import stop_words
len(stop_words.ENGLISH_STOP_WORDS) #sklearn
len(nltk_stopwords.words('english')) #nltk
len(wordcloud_STOPWORDS) #wordcloud
len(substances_all_alternative_spellings) #all substance spellings
# overlap between sklearn, nltk, and wordcloud stopwords
overlapped_stopwords = []
overlap_stopwords = [word for word in stop_words.ENGLISH_STOP_WORDS if word in nltk_stopwords.words('english') and word in wordcloud_STOPWORDS]
print(overlap_stopwords[:20])
# # NOTE: Long Run Time
# # Tokenize the text
# reports_text_tokenized = {}
# # reports_text is {substance: text} mapping
# # NOTE: This code takes an extremely long time to run
# for substance in reports_text:
# substance_text = reports_text[substance]
# tokens = nltk.word_tokenize(substance_text)
# tokens = [token.strip().lower() for token in tokens]
# substance_tokenized_text = ' '.join([token for token in tokens if token not in stopword_list])
# reports_text_tokenized[substance] = substance_tokenized_text
Text Pre-processing serves as a rough filter that removes the most low information aspects of our corpus and provides preliminary normalization so we can begin better analyzing our data. Certain operations are very common and fairly non-destructive, such as making all text lower case, tokenization, removing special characters, and removing extra spaces. That being said, one size does not fit all. This is especially true for engineering the stopwords list, where we are deliberately throwing away data before we even look at it. This could be problematic, as certain NLP tasks such as predicting gender from text precisely relies on signals from common words such as "the" (e.g.: males empirically use "the" more often than females, source: Lyle Ungar), which is commonly thrown away in the text pre-processing stage. Tokenization can also be non trivial, with options for bigrams, trigrams, character level tokenization, and beyond. Stemming and Lemmatization can also affect downstream analysis drastically, and with sufficiently large datasets, they are often not performed to preserve information (source: Lyle Ungar). Ultimately, the steps involved in text pre-processing are up to the judgement of the data scientist. In this thesis, we have chosen to be fairly conservative in our text pre-processing, choosing preserving information over being too selective. This may add noise to subsequent analysis, and a natural extension is to examine more closely the effects of pre-processing on subsequent analysis.
After preliminiary data preprocessing (cleaning), we perform data wrangling and Exploratory Data Analysis (EDA). Data Wrangling describes a suite of procedures to clean the data and transform the data into the right format for subsequent analysis, and is often the most time consuming stage of a data science pipeline. EDA refers broadly to procedures designed to better understand the data set through visualizations and summary statistics.
Here the EDA was mostly performed in R, in particular using dplyr, and was performed directly on the raw data. For reproducibility, the executible R code is also reproduced below; very similar procedures can be used with pandas.
# Sets up the ability to use R in a jupyter notebook
# https://stackoverflow.com/questions/39008069/r-and-python-in-one-jupyter-notebook
# To use R, add %%R to the beginning of a cell, before any code and any comments
%load_ext rpy2.ipython
%%R
# Installs the proper R packages
# Depending on local environment (e.g.: whether R packages for Anaconda are installed, this may or
# may not be necessary)
# if(!require('pacman')) {
# install.packages('pacman', repos = "http://cran.us.r-project.org")
# }
# pacman::p_load(dplyr, leaps, car, tidyverse, GGally, reshape2, data.table, ggcorrplot, bestglm, glmnet, mapproj, pROC, data.tale, tm, SnowballC, wordcloud, RColorBrewer, reshape2)
%%R
## Read in the raw trip reports
tripReports <- fread("trips.csv")
tripReports.df <- as.data.frame(tripReports)
# glimpse(tripReports)
# Number of trip reports in our dataset: 19924
# nrow(tripReports)
%%R
# Examine the substances present in the dataset, writing to output file
tripReports %>% select(substance) %>% write.table(file = "substances-all", append=FALSE, col.names = FALSE, row.names = FALSE, sep = "\n")
# # Examine the substances present in the dataset, display
tripReports %>% select(substance) %>% head(n=10)
%%R
## Manipulate the dataframe to have 1 hot encoding for substances
tripReports$substance <- tolower(tripReports$substance)
tripReports <- as.data.frame(tripReports)
# Explicity define the substance of interest
substances.of.interest <- c("substance.mushrooms", "substance.lsd", "substance.mescaline", "substance.cannabis", "substance.mdma", "substance.ayahuasca", "substance.nitrous_oxide", "substance.salvia", "substance.methamphetamine", "substance.dmt", "substance.5_meo_dmt", "substance.alcohol", "substance.ketamine", "substance.ibogaine", "substance.pcp", "substance.kava", "substance.kratom", "substance.morning_glory", "substance.syrian_rue", "substance.unknown", "substance.UNK")
for (sub in substances.of.interest) {
tripReports[sub] = 0
}
### Adding one hot encodings to the substances
tripReports$substance.mushrooms[grepl("mushroom|mushhrooms|mushooms|psilocin|psilocybin|psilocybe", tripReports$substance)] <- 1
tripReports$substance.lsd[grepl("lysergic acid|lsd", tripReports$substance)] <- 1
tripReports$substance.mescaline[grepl("mescaline|peyote", tripReports$substance)] <- 1
tripReports$substance.cannabis[grepl("cannabis|canabis|cannabbis|cannabinoid|cannabus|cannibis|cannibus| thc ", tripReports$substance)] <- 1
tripReports$substance.mdma[grepl("mdma|ecstacy|ecstasy|ectasy|molly", tripReports$substance)] <- 1
tripReports$substance.ayahuasca[grepl("ayahuasca|ayahausca|ayahusca|p. viridis|p.viridis|b.caapi|b. caapi|cappi|viridis", tripReports$substance)] <- 1
tripReports$substance.nitrous_oxide[grepl("nitric|nitrites|nitrogen|nitrous|whippets", tripReports$substance)] <- 1
tripReports$substance.salvia[grepl("salia|saliva|salivia|sally|salva|salvia|salvinorin", tripReports$substance)] <- 1
tripReports$substance.methamphetamine[grepl("met|meth|methampetamine|methamphetamine|speed", tripReports$substance)] <- 1
# dmt
tripReports$substance.dmt[grepl("nn-dmt", tripReports$substance)] <- 1
tripReports$substance.dmt[tripReports$substance == "dmt"] <- 1 # cannot just grep dmt otherwise this will include 5-meo-dmt
#5-meo-dmt
tripReports$substance.5_meo_dmt[grepl("5 meo-dmt|5-meo dmt|5-meo-dmt|5meo-dmt", tripReports$substance)] <- 1
tripReports$substance.alcohol[grepl("alchohol|alcohol", tripReports$substance)] <- 1
tripReports$substance.ketamine[grepl("ketamin|ketamine", tripReports$substance)] <- 1
tripReports$substance.ibogaine[grepl("iboga|ibogaine", tripReports$substance)] <- 1
tripReports$substance.pcp[grepl("pcp", tripReports$substance)] <- 1
tripReports$substance.kava[grepl("kava", tripReports$substance)] <- 1
tripReports$substance.kratom[grepl("kratom", tripReports$substance)] <- 1
tripReports$substance.morning_glory[grepl("glory|glories", tripReports$substance)] <- 1
tripReports$substance.syrian_rue[grepl("syrian rue|rue", tripReports$substance)] <- 1
tripReports$substance.unknown[grepl("unknown|unidentified|unkown", tripReports$substance)] <- 1
glimpse(tripReports)
%%R
## Number of reports containing each substance: NOTE: a report might have more than one substance
tripSubstances <- tripReports %>% select(-report, -title, -substance)
tripReportsCount <- data.frame(substance = names(tripSubstances), count = colSums(tripSubstances))
tripReportsCountSorted <- tripReportsCount %>% arrange(desc(count))
tripReportsCountSorted
%%R
## Find reports with only one unique substance
# Note: Long run time ~ 40 seconds
tripReports$substance.unique_label <- "NA"
# glimpse(tripReports)
uniqueSubstanceRows <- tripReports %>% select(-report, -title, -substance, -substance.unique_label) %>% rowSums() == 1
for (row in 1:nrow(tripReports)) {
# this row contains a unique substance
if (uniqueSubstanceRows[row]) {
for (substance in substances.of.interest) {
if (tripReports[row, substance] == 1) {
tripReports[row, ]$substance.unique_label<- substance
}
}
}
}
# glimpse(tripReports)
%%R
# TEMP, SAVE: Use this code to save the encoded table for future use in multiple formats
# save(tripReports, file="tripReportsEncoded.Rda")
# write.table(tripReports, file="tripReportsEncodedTable", sep = ";;", row.names = TRUE, col.names = TRUE )
# write.csv(tripReports, file="tripReportsEncoded.csv", sep = ",", row.names = TRUE, col.names = TRUE)
%%R
#### Create a data frame that combines the count for reports containing a substance and reports that are uniquely a single substance
tripReportsUniqueSubstanceCountSorted <- tripReports %>% group_by(substance.unique_label) %>% summarise(count = n()) %>% arrange(desc(count))
colnames(tripReportsUniqueSubstanceCountSorted) = c("substance", "count")
tripReportsUniqueSubstanceCountSorted %>% glimpse()
tripReportsCount_merged <- merge(tripReportsCountSorted, tripReportsUniqueSubstanceCountSorted, by="substance")
colnames(tripReportsCount_merged) = c("substance", "n_includes_substance", "n_single_substance")
%%R
# EXPORT: Table of selected substances and their counts
tripReportsCount_merged %>% arrange(desc(n_includes_substance))
%%R
## This data frame that encodes how many trip reports contains a substance and how many reports are uniquely such a substance'
tripReportsCount_merged.long <- reshape2::melt(tripReportsCount_merged)
# EXPORT: Graph of distribution of substances
tripReportsCount_merged.long %>% ggplot( aes(x = reorder(substance, -value), y = value, fill=variable)) + geom_bar(stat="identity", position = "dodge") + theme(axis.text.x = element_text(angle = 67.5, hjust = 1)) + ggtitle("Count of Substances Present in Trip Reports") + xlab("substance") + ylab("count") + labs(fill = "Number of Reports")
substance column¶The substance column is of type string, and describes what substance(s) were associated with a particular trip report. It turns out that users often co-consume multiple substances at once, and may convey that fact differently in different formats as plaintext. For example, supposed a user has had lsd, mushrooms, and lithium in a single session. They may record this fact as "lsd, mushrooms, and lithium", "lsd, lithium & mushrooms", "mushrooms, lsd and lithium" or in another way, using freely ,, and, and & as delimiters. As such, great care had to be taken using grep, regular expressions, and multiple string operations to label each report with a collection of substances in a standardized format.
We also observed a tremendous number of unique substances in the substance column. Here we describe a few notable characteristics, and key procedures we performed:
2783 unique spellings for the substance column, accounting for mistakes, alternative spellings, and parenthetical clarifications (e.g.: mda? (sold as ecstasy). A few ubstances such as 2-cb had no alternative spellings, while salvia had 130 unique spellings! (a few examples taken verbatim include: salvia, sally divinorum (5x extract), salvia d, saliva, salvinorin-a)1379 unique substance labels. It is very possible that I have missed a considerable chunck of duplicated entries, and an interesting task in itself is to further analyze the substance names.substance.<normalized-name>With these considerations, we used one hot encoding to denote the presence of a substance, and also added a substance.unique_label column for denoting those reports that only feature one substance.
# Results of the encoding
import pandas as pd
root_path = "."
pd_trips_encoded_from_R = pd.read_csv('{}/tripReportsEncoded.csv'.format(root_path))
pd_trips_encoded_from_R.head()
Since the EDA was performed on the raw data, here we normalize our corpus with our text pre-processing pipeline, and store the resulting data in pd_trips_encoded_normalized for later use
# CAUTION: Long Run Time
# EXPORT
# pre-process the encoded data frame to be the "master data frame" for subsequent analysis
pd_trips_encoded_from_R.dtypes
pd_trips_encoded_from_R["report"] = pd_trips_encoded_from_R["report"].astype('str')
pd_trips_encoded_from_R["title"] = pd_trips_encoded_from_R["title"].astype('str')
pd_trips_encoded_from_R["substance"] = pd_trips_encoded_from_R["substance"].astype('str')
corpus_list_encoded = pd_trips_encoded_from_R["report"].to_list()
corpus_list_encoded_normalized = normalize_corpus(corpus_list_encoded)
pd_trips_encoded_normalized = pd_trips_encoded_from_R
pd_trips_encoded_normalized["report"] = pd.Series(corpus_list_encoded_normalized)
# change column names to better fit pandas workflow
pd_trips_encoded_normalized.columns = pd_trips_encoded_normalized.columns.str.replace('.', '_')
# encoded and pre-processed data frame
# pd_trips_encoded_normalized.head()
'passed drug test !'# README: Use the following import to load the cleaned data back into memory
# EXPORT:
import pandas as pd
pd_trips_encoded_normalized = pd.read_csv("pd_trips_encoded_normalized.csv")
import matplotlib.pyplot as plt
# average length of each report (character level)
pd_trips_encoded_normalized = pd_trips_encoded_normalized.assign(report_char_length = pd_trips_encoded_normalized["report"].str.len())
# average lenght of each report (word level)
pd_trips_encoded_normalized = pd_trips_encoded_normalized.assign(report_length = pd_trips_encoded_normalized["report"].str.split().str.len())
# plotting
fig = plt.figure()
ax_report_char_length = pd_trips_encoded_normalized["report_char_length"].plot.hist(bins = 100)
ax_report_char_length.legend()
ax_report_char_length.set_xlabel("Number of Characters")
ax_report_char_length.set_title("Number of Characters in Reports")
ax_report_length = pd_trips_encoded_normalized["report_length"].plot.hist(bins = 100)
ax_report_length.legend()
ax_report_length.set_xlabel("Number of Words / Characters")
ax_report_length.set_title("Distribution of Number of Words / Characters in Reports")
# mean and media of report lengths
# print("Mean report length is {}".format(pd_trips_encoded_normalized["report_length"].mean()))
# print("Mean report length (in chars) is {}".format(pd_trips_encoded_normalized["report_char_length"].mean()))
# print("Median report length is {}".format(pd_trips_encoded_normalized["report_length"].median()))
# print("Median report length (in chars) is {}".format(pd_trips_encoded_normalized["report_char_length"].median()))
# print("Min report length is {}".format(pd_trips_encoded_normalized["report_length"].min()))
# print("Min report length (in chars) is {}".format(pd_trips_encoded_normalized["report_char_length"].min()))
# print("Max report length is {}".format(pd_trips_encoded_normalized["report_length"].max()))
# print("Max report length (in chars) is {}".format(pd_trips_encoded_normalized["report_char_length"].max()))
# obtaining the report with minimum length
# print('Shortest report was "{}"'.format(pd_trips_encoded_normalized.iloc[pd_trips_encoded_normalized["report_char_length"].idxmin()]["report"]))
%%R
# Perform subsequent analysis in R
R_trips_encoded_normalized = fread("pd_trips_encoded_normalized.csv")
R_trips_encoded_normalized <- as.data.frame(R_trips_encoded_normalized)
%%R
# Create Number of substance labels for each report
R_trips_encoded_normalized_eda <- R_trips_encoded_normalized %>%
mutate(num_substance_labels = rowSums(select(., starts_with("substance_"), -substance_unique_label)))
%%R
## MOST FREQUENT SUBSTANCES CO-CONSUMED WITH SELECTED SUBSTANCES
## Substances most frequently co-consumed with LSD
R_trips_encoded_normalized_eda %>% filter(substance_lsd == 1) %>%
select(-substance_lsd) %>%
gather(substance_key, isPresent, substance_mushrooms:substance_UNK) %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
arrange(desc(count))
# Substances most frequently co-consumed with Mushrooms
R_trips_encoded_normalized_eda %>% filter(substance_mushrooms == 1) %>%
select(-substance_mushrooms) %>%
gather(substance_key, isPresent, substance_lsd:substance_UNK) %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
arrange(desc(count))
# Substances most frequently co-consumed with MDMA
R_trips_encoded_normalized_eda %>% filter(substance_mdma == 1) %>%
select(-substance_mdma) %>%
gather(substance_key, isPresent, substance_mushrooms:substance_UNK) %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
arrange(desc(count))
# Substances most frequently co-consumed with mescaline
R_trips_encoded_normalized_eda %>% filter(substance_mescaline == 1) %>%
select(-substance_mescaline) %>%
gather(substance_key, isPresent, substance_mushrooms:substance_UNK) %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
arrange(desc(count))
# Substances most frequently co-consumed with dmt
# No co-consumption
R_trips_encoded_normalized_eda %>% filter(substance_dmt == 1) %>%
select(-substance_dmt) %>%
gather(substance_key, isPresent, substance_mushrooms:substance_UNK) %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
arrange(desc(count))
# Substances most frequently co-consumed with 5-MeO-dmt
R_trips_encoded_normalized_eda %>% filter(substance_5_meo_dmt== 1) %>%
select(-substance_5_meo_dmt) %>%
gather(substance_key, isPresent, substance_mushrooms:substance_UNK) %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
arrange(desc(count))
## ARCHIVED:
# R_trips_encoded_normalized %>% filter(substance_5_meo_dmt == 1) %>%
# select(starts_with("substance_"), -substance_unique_label) %>%
# colSums()
# code output below represents the the output from 5-MeO-dmt
%%R
# SUMMARY STATISTICS FOR NUMBER OF SUBSTANCES CONSUMED WITH EACH SUBSTANCE
# convert data to long data format
R_trips_encoded_normalized_eda_long <- R_trips_encoded_normalized_eda %>% tidyr::gather(substance_key, isPresent, substance_mushrooms:substance_UNK)
## validate the long data format
# R_trips_encoded_normalized_eda_long %>% select(-report) %>% group_by(title) %>% summarize(n()) %>% head()
# EXPORT:
# Distribution of number of substances for each of the selected substances
R_trips_encoded_normalized_eda_long %>%
filter(isPresent == 1) %>%
group_by(substance_key) %>% summarize(min_num_lab = min(num_substance_labels),
med_num_lab = median(num_substance_labels),
mean_num_lab = mean(num_substance_labels),
max_num_lab = max(num_substance_labels))
%%R
# DISTRIBUTION OF THE NUMBER OF SUBSTANCE LABELS
# Those without a substance label indicate the substances that were not analyzed
# As we can see, a significant portion of the corpus contain substances that were not analyzed
R_trips_encoded_normalized_eda_long %>%
filter(isPresent == 1) %>%
ggplot(aes(x = num_substance_labels, fill = substance_key)) +
geom_bar() +
ggtitle("Number of Substance Labels")
%%R
# PERCENTAGE OF REPORTS BY SUBSTANCE WITH CERTAIN NUMBER OF SUBSTANCES
# https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr
R_trips_encoded_normalized_eda_long %>%
filter(isPresent == 1) %>%
group_by(substance_key, num_substance_labels) %>%
summarise(count = n()) %>%
mutate(percent = round(count / sum(count) * 100, digits=1)) %>%
filter(num_substance_labels <=3) %>%
print(n = 100)
# summarize(one_substance = sum(value[num_substance_labels == 1])
Text parsing to obtain information about language syntax and structure is often performed as part of the NLP pipeline. We mention this here for completeness, but omit analysis for the thesis. Interested readers are encouraged to refer to the excellent article A Practioner's Guide to Natural Language Processing). Topics include:
Named Entities refer to real world objects such as an organization, person, quantity, time that are either physical or abstract. An example of a Named Entity is Apple (ORG), the company. The task of Named Entity Recognition (NER) is the process of extracting and labelling these real world objects from free text, and is a non-trivial task — for example, depending on the context, Apple can be a proper noun or simply a fruit.
We can use pretrained models from NLP packages such as spacy to perform NER and visualize the results. An example is shown below:
import spacy
from spacy import displacy
import en_core_web_sm
# load the pretrained small english language model
spacy_nlp = en_core_web_sm.load()
target_report = pd_trips_encoded_normalized.iloc[1777]["report"]
spacy_doc = spacy_nlp(target_report)
displacy.render(spacy_doc, style="ent", jupyter = True)
Another task that is common in the NLP pipeline is to perform sentiment analysis with pre-trained models. The most commonly used models (source) are
We include this for completeness, but skip the in-depth analysis for this thesis.
As mentioned, there were a substantial number of substance in the corpus, and it was unfeasible to analyze all the substances. In the end we decided to focus on 20 labels, which was chosen both for their pharmalogical properties (psychedelics or most closely resembling psychedelics), and also their frequencies. All the substances chosen in the following list were among the most frequently occuring substances in the corpus
# create a constant list to reprsent chosen substances
SUBSTANCES_OF_INTEREST_LIST = ['substance_5_meo_dmt',
'substance_alcohol',
'substance_ayahuasca',
'substance_cannabis',
'substance_dmt',
'substance_ibogaine',
'substance_kava',
'substance_ketamine',
'substance_kratom',
'substance_lsd',
'substance_mdma',
'substance_mescaline',
'substance_methamphetamine',
'substance_morning_glory',
'substance_mushrooms',
'substance_nitrous_oxide',
'substance_pcp',
'substance_salvia',
'substance_syrian_rue']
# the most classical psychedelics
SUBSTANCES_PSYCHEDELICS_LIST = ["substance_mushrooms",
"substance_lsd",
"substance_mescaline",
"substance_5_meo_dmt",
"substance_dmt"]
The following data frame represents the data that has been cleaned, filtered, and encoded, and will be used in all subsequent analysis.
# Saving the cleaned, filtered, and encoded data to file
# pd_trips_encoded_normalized.to_csv("pd_trips_encoded_normalized.csv", index = False)
# README: Use the following import to load the cleaned data back into memory
# EXPORT:
import pandas as pd
pd_trips_encoded_normalized = pd.read_csv("pd_trips_encoded_normalized.csv")
# first 6 rows of the cleaned data
pd_trips_encoded_normalized.head(n=6)
For those unfamiliar with natural language processing, we introduce a simple but very central and powerful idea: each word can be represented as a vector of real numbers. This may seem counterintuitive, but the ability to represent words as vectors is absolutely key to our ability to perform analysis on unstructured text — almost all machine learning algorithms take vectors or matrixes as inputs and vector representations open us to a world of vector and matrix operations such as dot products and cosine similarity, which allow us to ask (and answer) questions such as how **similar** are two words? intelligently and precisely.
In the above infographic, each word is represented by 4 numbers, or 4 dimensions, representing a word's animal-ness, domesticated-ness, pet-ness, fluffy-ness. With this, we can ask whether a monkey is more similar to an elephant or more similar to a cheetah using cosine similiarity.
# Illustration of cosine similarity
# It looks like a monkey is more similar to a cheetah than it is to an elephant!
monkey = np.array([-0.02, -0.67, -0.21, -0.48])
elephant = np.array([-0.04, -0.09, 0.11, -0.06])
cheetah = np.array([0.27, -0.28, -0.2, -0.43])
# cosine similarity relies on the intuition that the smaller the angle between two vectors, the more similar they are
def cos_sim(v1, v2):
sim = np.dot(v1, v2) / (np.sqrt(sum(v1**2)) * np.sqrt(sum(v2**2)))
return sim
cos_sim(monkey, elephant) # 0.4926634171010775
cos_sim(monkey, cheetah) # 0.8251926299381945
Words are represented as vectors mainly in two ways (see here also):
The first class of methods relies on the distributional hypothesis in linguistic (Zellig Sabbettai Harris, 1954), which formalized the intuition that words that co-occur together in the same contexts tend to be more similar. There are two main ways to construct vectors this way:
term-document matrix: Very briefly, suppose we had a vocabulary (unique tokens) of size V, and a collection of documents (such as wikipedia pages) of size D. We can make a V x D matrix TDM, with words as the rows and documents as the columns. The position M(i, j) would then be how many times word i occurs in document j. Just like that, we have created a term-document matrix, one of the most important building blocks in natural language processing (the transpose, document-term matrix, is equally often used). Recall that each row represents a word, and each row is a vector of dimension D. In this precise way, we have just represented each of the V words in our vocabulary as a vector of dimension D.term-term matrix: now, instead of using a V X D matrix, we can make a V x V matrix TTM, and figure out a sensible way to populate the elements. One intuitive way is to take a "sliding window" with some width over the text, and count how many times some word j occurs in the context window of word i. That count would then go into TTM(i, j). In this way, each word is represented as V dimensional vector.Note that because V and D are often on the order of millions, these vector representations are absolutely huge, and many fields would be 0. For this reason, these are also called sparse vector embeddings.
The second class of methods rely on the intuition that if we use words as inputs to peform some machine learning task (such as predicting the most likely surrounding context words given a current word using a neural network, say), the weights and biases learned by the neural network will act as a "good" approximations for the vector representation of the word.
The most commonly used models are Word2Vec (CBOW and Negative Skip Gram Sampling) by Google, GloVe Model from Stanford University, and FastText Model by Facebook. Although the three models rely on different machine learning tasks, they all rely on randomly intialized vectors converging to vector approximations of words as the machine learning task trains (Chris Callison-Burch). The exact details are beyond the scope of this thesis, but there are 3 very important takeaways: (1) The dimensions of the word vectors are arbitrary and user specified, (2) The numbers in the learned vectors don't have intrinsic meaning like how the ones in the sparse representations above do. The learned vector for a word is an abstract embedding in a higher dimensional space, derived from a particular training corpus and particular algorithm. There is no single vector that is a particular word. Finally, (3) Since the dimensions are user specified they are typically kept to around 100 to 300, and as such these word vectors are often called dense vector embeddings
For those interested, here are three infographics we made that provide some context for the different algorithms, examples of embeddings learned, and links to the original papers that described the different papers. Notice here that words in the Word2Vec are 100-dimensional vector embeddings, and words in the GloVe example are 300-dimensional vector emeddings. Indeed, multiple pre trained models with different initial configurations (such as dimension of embedding and size of training corpus) are readily available for use. For a deeper yet approachable introduction to word embeddings arising from deep learning models, Diparjan Sarka's article on Towards Data Science is an excellent place to start

Once the idea that words can be represented as vectors sinks in, the idea that entire documents can be represented as vectors feel very intuitive. Recall in the previous section we saw a term-document matrix, which was a V x D matrix with terms as rows and documents as columns. Taking the rows, we represented each term as a D-dimensional vector. On the flip side, if we simply took the columns, we could represent documents as V dimensional vectors. This results in a sparse embedding of a document.
We can also utilize the fact that documents are a collection of words, each of which can be represented by a vector. As such, we can simply take the average of all the word vectors in a document, and call that our document vector. This results in a dense embedding that represents a document. In practice, this seemingly naive method of averaging works surprisingly well (Chris Callison-Burch), and the intuition of averaging word vectors is agnostic to which word embedding was chosen, be it GloVe, Word2Vec, FastText or a DIY embeddng. One commonly used document embedding is Google's Doc2Vec, which, as you might have guessed, builds on top of Word2Vec word embeddings.
Now that we have introduced the important intuition that words and documents can be represented as vectors, we can bring this theory into practice and vectorize our corpus. Let us begin with the basics and slowly get more complex.
This corresponds to constructing the aforementioned term-document matrix, where we constructed a V x D matrix from the corpus, where each document is represented with a V dimensional vector. sklean has a built in vectorzier called CountVectorizer which we can use directly on our normalized corpus. As an implementation detail, this procedure actually produces a document-term matrix, which is the aforementioned term-document matrix flipped along the diagnal. Practically, this creates a D x V matrix with documents as rows and terms as columns, with all else the same. Moving forward, we will adopt the convention of keeping documents as rows and terms as columns.
# OPTIONAL: In case Kernel dies, use this to load back dataset
import pandas as pd
pd_trips_encoded_normalized = pd.read_csv('{}/pd_trips_encoded_normalized.csv'.format(root_path))
# with open('pd_trips_encoded_normalized.csv', 'rb') as in_file:
# pd_trips_encoded_normalized = pickle.load(in_file)
# pd_trips_encoded_normalized.head()
# SPIKE: for some very strange reason, reading the csv file back resulted in np.nan trip reports
pd_trips_encoded_normalized["report"].isnull().sum()
# dropna: https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan
pd_trips_encoded_normalized.dropna(subset=['report'], inplace=True)
# check there are no NA's
# pd_trips_encoded_normalized["report"].isnull().sum()
# Code adapted from:
# https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41
from sklearn.feature_extraction.text import CountVectorizer
# CountVectorizer expects a list of documents; i.e.: [string]
norm_corpus = pd_trips_encoded_normalized["report"].to_list()
len(norm_corpus) # 19924 reports
# instead of using raw counts, we can divide matrix(i, j) by sum over column j to get a proporition
# min_df = minimum document frequency
# max_df = maximum document frequency
tf_vectorizer = CountVectorizer(min_df=0., max_df=1.)
dtm_tf = tf_vectorizer.fit_transform(norm_corpus)
dtm_tf_nparr = dtm_tf.toarray()
# get all unique words in the corpus
vocab = tf_vectorizer.get_feature_names()
# show document feature vectors
pd.DataFrame(np.round(dtm_tf_nparr, 2), columns=vocab).head()
############################################
## Ngram model extensions
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams
# bv = CountVectorizer(ngram_range=(2,2))
# bv_matrix = bv.fit_transform(norm_corpus)
# bv_matrix = bv_matrix.toarray()
# vocab = bv.get_feature_names()
# pd.DataFrame(bv_matrix, columns=vocab)
# Some basic properties we can learn about our data
len(vocab) # 107108 vocabulary words, or "tokens", in our training corpus
vocab[20000:20010] # example of tokens
Each row represents a document, and each column is a vocabulary word. We notice that the term-document matrix generated is extremely sparse (mostly zeros), and there seems to be a large number of non-sense tokens such as 00, 000, 0000 etc. More processing may be required to remove these tokens for better performance.
Using just the raw word frequencies, we might overemphasize words that are shared across all trip reports, giving us little if any insight into the most significant words that is associated with each type of substance. If some word is faily rare (like "avocado") in general in the English language, but appeared quite few times in a single document, it makes intuitive sense to upweight the importance of that rare word with respect to that particular document. In other words, we need a formlua to capture the intuition "if a word appears very frequently in a particular document and doesn't appear too frequently in other documents, the word is probably important for that particular document". Capturing this intuition and to improve upon the raw frequency baseline, we can use the tf-idf weighting for each (word, document) pair as a weighting term.
tf-idf stands for term frequency-inverse document frequency, and is specified for each (word, document) pair. It is by far one of the most widely used weighting scheme in considering how important a word is to a document, and is widely used in text mining and information retrieval. For an approachable academic introduction to tf-idf, see Chris Callison-Burch's lecture slides.
tf stands for Term Frequency (Luhn 1957), and is simply the number of times a word appears in a document. Precisely, the term trequency of word $i$ in document $j$ is given by:
$tf_{ij}$ = # of times word $i$ appears in document $j$
idf stands for Inverse Document Frequency (Spark Jones 1972), and intuitively measures how many documents contain a particular word, with words appearing frequently in many documents having less importance to a particular document, thus having a lower idf score. Precisely, the inverse document frequency of word $i$ is given by
Where $df_i$ is the document frequency of word i, the number of documents containing word $i$
With these definitions, we stay the tf-idf weighting of word $i$ in document $j$ is given by
$$ tf.idf(i,j) = tf_{ij} \cdot idf_{i}$$# TFIDF Code
# adapted from: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
dtm_tfidf = tfidf_vectorizer.fit_transform(norm_corpus)
dtm_tfidf_nparr = dtm_tfidf.toarray()
vocab = tfidf_vectorizer.get_feature_names()
pd.DataFrame(np.round(dtm_tfidf_nparr, 2), columns=vocab).tail()
# OPTIONAL
## Use this cell as a template to save and load large data files;
## replace filenames appropriately;
# import pickle
# ### Stores the tokenized reports:
# with open('dtm_tfidf', 'wb') as out_file:
# pickle.dump(dtm_tfidf, out_file)
We can use these vectorized representations of documents to ask: how similar are two documents? Or rather, how similar are any two trip reports to each other? With this question in mind, we can construct a similiarity matrix based on cosine similarity. Now, the following matrix is 19924 x 19924, and is very hard to visualize and interpret. Nevertheless, we include it here for completeness as one way to visualize the relationship between documents.
Similarity matrices such as these are building block inputs to clustering algorithms such as KMeans and certain visualization algorithms (See sklearn), and is an important component in the data scientist's toolkit.
# Adapted from https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix_tfidf = cosine_similarity(dtm_tfidf)
similarity_tfidf_df = pd.DataFrame(similarity_matrix_tfidf)
similarity_tfidf_df.head(10)
# OPTIONAL
# with open('similarity-matrix_dtm_tfidf', 'wb') as out_file:
# pickle.dump(similarity_tfidf_df, out_file)
Intuitively, one can imagine that a document such as a blog post, wikipedia article, or in our case, a psychedelic trip report, might consist of numerous topics — a news article published in the New York Times might talk about extreme weather, deforestration, and climate change, when covering say, a hurricane; a psychedelic trip report might talk about time, experiences with nature, or feelings of bliss. How do we figure out the topics in an unlabelled collection of documents? We turn to a class of topic modelling algorithms, of which the most widely used is Latent Dirichlet Allocation (LDA). (Side note: the acronym LDA is also often used to mean "Linear Discriminant Analysis", used for classification or dimensionality reduction)
Latent Dirichlet Allocation (LDA) has fancy sounding name, and the inner workings are indeed fairly complex, but what it does is quite intuitive: LDA attempts to (1) automatically discover topics from a collection of documents, and (2) model each document in that collection as exhibiting multiple topics in diffrent proportions (Blei 2012). So what is a topic? Each topic is precisely a list of words. For example, the following lists of words represent 4 topics extracted automatically from the Journal Science, where each topic is a list of words (Blei 2012, Fig 2):
topic1: human genome dna genetic genes sequence gene molecular sequencing map information genetics mapping project sequencestopic2: evolution evolutionary species organisms life origin biology groups phylogenetic living diversity group new two commontopic3: disease host bacteria diseases resistance bacterial new strains control infectious malaria parasite parasites united tuberculosistopic4: computer models information data computers system network systems model parallel methods networks software new simulationsIt should be noted that since LDA is an unsupervised technique, we don't require our input text documents have topic labels. Instead, LDA processes our collection of text documents, and extracts topics automatically. Notice also that the topics discovered don't have "names" by themselves; by looking at the list of words associated with each topic above, it is reasonable to say topic1 is approximately "genetics", topic2 is approximately "evoluation", topic 3 is approximately "disease", and topic4 is approximately "computers"; in practice the labelling of topics can either by done by humans, or by automatically taking a word that is most "relavent" to that topic as the label, by some sensible definition of "relavence". Indeed, the above word list are sorted from most relavent to least relavent for each topic.
This beautiful graphic presented by Blei 2012 helps us further the intuition of how each document can exhibit multiple topics:

Now, before we dive deeper into the inner workings of LDA, let's take a look at the LDA topic modelling pipeline, lest we lose track of the big picture:

As we can see, we have the collection of $D$ documents (i.e.: trip reports from Erowid), each with some $N_j$ words, which was obtained via processes described in Section 3: Data Aquisision. In Section 4: Data Cleaning and Exploratory Data Analysis (EDA), we processed our documents using an array of pre-processing techniques such as tokenization and stopwords removal. In Section 5: Text Representation and Feature Engineering, we created vector representations of documents through a variety of methods, including the naive (but fairly effective) term-document counts, along with tf-idf weighting, both of which fall under "Bag of Words (BOW)" methods. And now, in this section, we use LDA to create a topic model to capture the topics representd in our collection of trip reports. It should be noted that LDA is not the only topic model available to modern data scientists. Other methods include Non-negative Matrix Fatorization (NMF) and Probabilistic Latent Semantic Analysis (pLSI), which we note here but do not dive deeply into to maintain thesis scope. For those familiar with singular value decomposition (SVD): both NMF, and pLSI relate deeply to SVD of the term-document matrix. Furthermore, "LDA can also be seen as a type of principal component analysis (PCA) for discrete data" (Blei 2012).
Now, we are ready dive deeper into what LDA does. Let us begin with an infographic for intuition:

LDA takes as input a collection of text documents, which has been preprocessed and vectorized. Note that the format of these text documents isn't set in stone, but most commonly for LDA implementation we have a $D$ x $m$ matrix, where each row represents a document $d$, represented by a $m$-dimensional document vector embedding. We discussed many document vector embeddings in Section 5. For ease, we can use the tf-idf weighted term-document counts.
For concreteness, below is the vectorized representation of the first 6 documents in our corpus:
# First 5 rows of our tf-idf weighted document-term matrix
# each row represents a document
# each column is a term, or "features"
pd.DataFrame(np.round(dtm_tfidf_nparr, 2), columns=vocab).head()
# wierdly enough, this following does not work; has to do with sparse vs. dense representations
# pd.DataFrame(np.round(dtm_tfidf, 2), columns=vocab).head()
# Using the document-term matrix with tfidf weighting, our input is D x m, where D = 19915, the number of documents
# and m = 107108, or V, the size of our vocabulary
dtm_tfidf_nparr.shape
Referring to the picture above, LDA gives us three things:
ttm), a $K$ x $V$ matrix, which specifies a word distribution for each topic k; in other words, since each topic is a collection of words, it makes sense that some words are more "representative" of some topic $k$ than other words. In this way, each topic has a distribution over words.dtopm), a $D$ x $K$ matrix, which specifies a topic distribution for each document $d$; in other words, since we can understand each document as having multiple topics, it makes sense that some topics are more saliently present in a document as its "main topics / main themes". Since this document-topic matrix is a vectorized representation of the doucments, it could also be used as an input to wordcloud, or as a feature for other machine learning algorithms.For the reader interested in how LDA works, we present a more technical treatment of LDA here, which is heavily drawn from Blei 2012 and Christine Choig's Lecture Slides
Firstly, we note that the topic structure of our documents is latent (i.e.: hidden, unobserved). We only observe the documents and words themselves.
As such, LDA attempts to infer the hidden topic structure from the observed documents (Blei, 2012). Formally, LDA tries to compute the posterior distribution, or conditional distribution of the hidden structure given the observed documents. In Blei's words: "what is the hidden structure that likely generated the observed collection?"

To describe LDA formally, we introduce the following notation and (adpated) explanations from Blei 2012 to maintain consistency with standard notation:
K x 1 vector), where each topic $\beta_k$ is a (discrete) distribution over the vocabulary. Recall that a topic $k$ is a list of words, some of which are more "relavent" than others for the given topic $k$; in this way, each topic $\beta_k$ is a discrete distribution over the vocabulary.real number $\in [0, 1]$) represents the proportion of topic $k$ in the $d$-th document. Recall that if we have $K$ topics, each document is modelled as a mix of these K topics, each topic with some proportion between 0 and 1. This allows us to make statements such as "this document is 80 percent about time, 18 percent about genetics, and 2 percent about other micellaneous topics".n x 1 vector), where $z_{d,n}$ (integer $\in [0, K]$) is the topic assignment for the $n$th word in document $d$; Recall that each document is thought of as a mixture of $K$ topics, each with some proportion between 0 and 1. One way to get this proportion is by conceptualizing each word $n$ in a document $d$ as having a topic $k$, and the proportion of words "with" topic $k$ is the proporition of topic $k$ in document $d$The full joint probability distribution modelled by LDA is given by:
$$ p( \beta_{1:K}, \theta_{1, D}, z_{1:D}, w_{1:D}) = \prod_{i=1}^{K}p(\beta_i)\prod_{d=1}^D p(\theta_d) \left( \prod_{n=1}^N p(z_{d, n} | \theta_d)p(w_{d, n} | \beta_{1:K}, z_{d, n}) \right)$$Recall that LDA infers the hidden topic structure by modelling the conditional distribution of the hidden structure given the observed documents. Formally, LDA is modelling:
$$p( \beta_{1:K}, \theta_{d, k}, z_{1:D} | w_{1:D}) = \cfrac{p( \beta_{1:K}, \theta_{d, k}, z_{1:D}, w_{1:D})}{w_{1:D}}$$The numerator, or joint probability distribution over all the random variables, is straightforward to compute, but the denominator, obtained by margininalizing over all latent topic structures, is exponentially large and intractible to compute (Blei 2012). As such, LDA is approximately computed in two main ways, through sampling-based algorithms and variational algorithms (Blei 2012). A sampling-based algorithm constructs an approximation of the posterior, and repeatedly draws samples from this approximation so in the limit converges to the true posterior distribution, and is non-deterministic (Blei, 2012). The most popular sampling-based algorithm is Gibbs Sampling. Variational methods are deterministic algorithms. They pre-suppose that the posterior distribution is from a parametrized family of distributions (e.g.: a bell curve that's centered or off center, narrowerer or flatter, depending on the parameters of mean and variance) (Blei, 2012). This makes variational methods an optimization problem as it tries to find the parameters that define a distribution closest to the posterior distribution. Here, closeness is measured by Kullback Liebler Divergence (KL-Divergence), which takes as input two distributions (order matters) and returns a real valued measure of closeness (Blei, 2012).
Assumptions of LDA (adapted from Blei, 2012):
Topic Model Extensions, and Future Research Areas: See Wallach (2006) and Griffiths, Styvers & Blei (2015) for models that relax the bag of words assumptions use words conditioned on previous words and LDA in conjunction with a hidden markov model (HMM), which lead to improved performance. Other types of topic models include (Blei 2012): correlated topic model and pachinko allocation machine, which account for correlations between topics (e.g.: machine learning and statistics may be more correlated than machine learning and penguins; the spherical topic model allows words to be unlikely in a topic (for example, "coconuts" is unlikely in documents about watches); sparse topic models and “bursty” topic models respectively impose more constraints over topic distributions and more realistically model word counts (Blei 2012).
For those familiar with plate models, the following graphic (Blei 2012) reminds us that LDA is a type of a broader class of algorithms: probabilistic graphical models. $D$ and $N$ represent multiplicity, i.e.: the referenced cells are repeated $D$ and $N$ times. $\alpha$ and $\eta$ are hyperparameters that specify the shape of two Dirichlet Distributions, used to describe the $\theta_d$ (the topic distribution of document d) distributions and $\beta_k$ (word distribution of topic k) distributions respectively.

In practice, LDA is implemented either from a Gibbs Sampling approach (e.g.: mallet package) or via Variational Methods (e.g.: gensim), and optimized using Expectation Maximization (EM) algorithms. We note without theoretical justification that Lyle Ungar has found empirically that mallet works better and provides more interpretable models (source: conversations). Luckily for us, packages such as mallet and sklearn are easy to use, and provide state of the art LDA implementations. We first use sklearn implementation of LDA, which uses an online variational Bayes algorithm, due to its ease of use and compatibility with LDA visualization packages such as pyLDAvis.
# USE sklearn: https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb
# imports for visualizing LDA
import pyLDAvis
import pyLDAvis.sklearn
# setting for jupyter notebooks
pyLDAvis.enable_notebook()
# inports for LDA and preparing data set
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
####
# Prepare LDA inputs: vectorization of documents
# use the tfidf matrix we had before as the document vector represetation for input into LDA
# our lda input is simply dtm_tfidf, the document term matrix with tfidf weighting. We take a look at the shape
print(dtm_tfidf.shape)
####
# Fit Latent Dirichlet Allocation models
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=5, random_state=0)
lda_tfidf.fit(dtm_tfidf)
# Fitting LDA with TF document term matrix
lda_tf = LatentDirichletAllocation(n_components=5, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda_tf.fit(dtm_tf)
# print the top words for each topic
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
print(model.components_.shape)
print(len(feature_names))
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print()
# Print the topics derived from tfidf document-term matrix
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(lda_tfidf, tfidf_feature_names, n_top_words)
# Print the topics derived from tf document-term matrix
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda_tf, tf_feature_names, n_top_words)
# tt_matrix is the topic-term matrix, and gives us each topic (row) as a list of words (columns)
# Source: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41
# tt_matrix = lda_tfidf.components_
# for topic_weights in tt_matrix:
# topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
# topic = sorted(topic, key=lambda x: -x[1])
# topic = [item for item in topic if item[1] > 5]
# print(topic)
# print()
The package pyLDAvis is the python implementation of the powerful R visualization package LDAvis for LDA model. The reader is encouraged to play with package, which gives us interactive visualizations. It is commented out here due to its long run times.
###
# Visualization of Latent Dirichlet Allocation models
# using the tf-idf model to visualize TF-IDF
# pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)
# pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
Results from tfidf DTM, with 5 topics
Results from tf DTM, with 5 topics
Firstly, we remark that the number 5 was completely arbitrary — finetuning the number of topics is in itself a non-trivial task, and provides a plethera of future research directions. Next, these topics selected are fasinating. First observing the topics extracted using the tfidf document-term matrix. Topic 0 seems to be drug names, many of which are nootropics (cognitive performance / mood enhancing drugs) and antidepressants. Topic 2 relates to time and experience, and seems to be an abstract expression of phenomenology. I cannot quite make sense of topics 1, 3, and 4, which seem to be in Italian and German. This was incredibly surprising for me as I had not realized there are such a significant portion of these foreign reports featured in the corpus.
The topics extracted using the term frequency document-term matrix, seem even more interesting. Topic 0 seems to deal primarilty with time, with the words "time", "day", "hours" and the experience of it ("experience", "feel", "like", "effects", "felt"). Topic 2 also mentions time, and seems to focus primarity on the logistics of trips, featuring simple words like "go", "around", "went", "still", "thought". Topics 1 and 3 are in Italian and German respectively. Topic 4 is by far the most interesting to me. It mentions "experience", "mind", "reality", "state", "energy", "space", "consciousness", all of which refer to the consciousnessness expanding, ineffability, and the viceral noetic qualities of the most powerful psychedelic experiences. Feeling at one with the universe, becoming present with the fabric of reality and noticing energy flows of space and time, are experiences that have been described both in literature (Stanislav Grof, 2009) and through conversations with colleagues and friends. Of particular note is the presence of both "dmt" and "salvia" in this topic — indeed, dmt is among the most powerful psychedelic substances in the world, and salvia remains one of the least understood and mysterious (See Rogan, 2018, Joe Rogan in conversation with Hamilton Morris for layman introductions). Together, dmt and salvia represent a certain frontier in our understanding of psychedelics.
Word Clouds are an intuitively visual way to present vast amounts of textual information, revealing the key themes conveyed in a collection of documents. Prima facie, generating word clouds feel like a simple problem, yet very quickly deeper considerations arise, including (1) using the appropriate preprocessing, (2) whether to include documents of class $j \neq i$ when constructing a word cloud for documents of class, and (3) how to scale tokens, given (1) and (2).
The most basic word clouds use term counts, or term frequencies, to scale the size of the words.
Now, we actually have two high levels choices regarding creating word clouds:
In the first approach, we can simply use the document-term matrix containing tfidf weightings we created earlier, which is of size D x V where D is total number of trip reports and V is the size of our vocabulary. In this representation, every trip report is represented as a vector. Hence, to get an idea what the experience of a particular substance is, we can randomly sample reports of a particular substance and create multiple word clouds for that substance.
In the second approach, we can concatenate all the reports for a particular substance into one document, for a total of 19 documents or samples. Under this schema, each substance is a single document, which is the concatenation of all the reports corresponding to that substance. It should be noted that that there are concerns with doing this, as idf typically assumes a large number of documents in the corpora.
Let's being with the first approach and then explore the second.
We will use our normalized corpus. Note that since each trip report may contain multiple classes, we will include a document that contains k classes in k word clouds.
# # CHECKPOINT: Read in data if kernel has died
# import pandas as pd
# pd_trips_encoded_normalized = pd.read_csv("pd_trips_encoded_normalized.csv")
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
## Re create the Term Frequencies document term matrix as before
pd_trips_encoded_normalized.dropna(subset=['report'], inplace=True)
pd_trips_encoded_normalized["substance_unique_label"] = pd_trips_encoded_normalized["substance_unique_label"].str.replace('.', '_')
norm_corpus = pd_trips_encoded_normalized["report"].to_list()
# the same procedure as above
tf_vectorizer = CountVectorizer(min_df=0., max_df=1.)
dtm_tf = tf_vectorizer.fit_transform(norm_corpus)
dtm_tf_nparr = dtm_tf.toarray()
vocab = tf_vectorizer.get_feature_names()
pd_report_vocab_tf = pd.DataFrame(np.round(dtm_tf_nparr, 2), columns=vocab)
pd_report_vocab_tf.head()
# obtain the tf vectors of the reports for which we wish to create word clouds
from random import seed
from random import choice
# for ease of analysis, we set a seed so the random reports are the same every time
# change seed to generate wordclouds of differente reports
# also see: https://machinelearningmastery.com/how-to-generate-random-numbers-in-python/
chosen_seed = 14*(10**9)
seed(chosen_seed)
N_REPORTS_EACH_SUBSTANCE = 5 # number of reports to select for each substance
substance_reports_tf_map = {} # map {substance: {report: {vocab: tfidf weighting}}}
for substance in SUBSTANCES_OF_INTEREST_LIST:
# list of indexes corresponding to reports only containing mushrooms
index_substance_reports = pd_trips_encoded_normalized.index[pd_trips_encoded_normalized['substance_unique_label'] == substance].tolist()
# randomly select reports corresponding to each substance
selected_report_indices = []
for _ in range(N_REPORTS_EACH_SUBSTANCE): selected_report_indices.append(choice(index_substance_reports))
# Store tf vectors for each report
substance_reports_tf_map[substance] = {}
for reportindx in selected_report_indices:
substance_reports_tf_map[substance][reportindx] = pd_report_vocab_tf.iloc[reportindx].to_dict()
###### PLOTTING WORDCLOUDS BASED ON REPORT LEVEL TF WEIGHTINGS ######
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
substances_to_plot = SUBSTANCES_PSYCHEDELICS_LIST + ["substance_mdma"]
num_substances = len(substances_to_plot)
# Plot the the various wordclodus
fig = plt.figure(figsize=(70, 60))
fig.suptitle("Word Clouds based on individual report TF weightings", fontsize=16)
print("Plotting word clouds with seed {} for substances: {}\n".format(chosen_seed, substances_to_plot))
print("plotting {} rows (for number of substances) and {} columns (for num reports per substance)\n\n".format(num_substances, N_REPORTS_EACH_SUBSTANCE))
inx = 0
for index_substance, substance in enumerate(substances_to_plot):
for index_reportindx, reportindx in enumerate(substance_reports_tf_map[substance]):
# WordCloud also takes a stopwords = [] as argument; omitted.
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate_from_frequencies(substance_reports_tf_map[substance][reportindx])
# Subplot control: https://towardsdatascience.com/subplots-in-matplotlib-a-guide-and-tool-for-planning-your-plots-7d63fa632857
# print("plotting on ({}, {})".format(index_substance, index_reportindx))
ax = plt.subplot2grid((num_substances, N_REPORTS_EACH_SUBSTANCE), (index_substance, index_reportindx)) # note here for subplots index begins at 0
ax.set_title("{} Wordcloud".format(substance))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
inx += 1
# plt.savefig('wordclouds_naive_psychedelics_tf.png')
# plt.show()
We take slight pause to reflect upon our usage of the word cloud package. Here to create wordclous, we use the wordcloud package. Note that there are many tunable parameters in the package, and for this section, we are making numerous assumptions, including:
frequency onlywordcloud package has a relative_scaling attribute, which specifies the importance of relative word frequencies for font-size. If relative_scaling = 0, the only word-ranks are considered; relative_scaling=1 means words size is directly proportional to their frequency; Default of relative_scaling = 0.5 considers both word frequency and word rank, and is used in our analysis. Insight retrieved from wordcloud documentation and from this tutorial# Use the wordcloud package to generate baseline wordclouds
# Also see https://github.com/amueller/word_cloud
# Also see https://www.datacamp.com/community/tutorials/wordcloud-python
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
from wordcloud import STOPWORDS as wordcloud_STOPWORDS
import nltk
from nltk.corpus import stopwords as nltk_stopwords
nltk.download('stopwords')
import matplotlib.pyplot as plt
pd_trips_encoded_normalized = pd.read_csv("pd_trips_encoded_normalized.csv")
### Concatenate corpora for substances of interest
substances_all_list = pd_trips_encoded_normalized.loc[:, 'substance_mushrooms':'substance_syrian_rue'].columns.tolist()
# {substance_<name>: text for all reports containing substance}
substance_reports_map = {}
for substance in substances_all_list:
# Concatenate with " " by default, TODO: Can specify separator as such: str.cat(sep="...")
reports = pd_trips_encoded_normalized.query('{} == 1'.format(substance))["report"].str.cat(sep = " ")
substance_reports_map[substance] = reports
# ## Optionally perpare stopwords corresponding to substance names
# # a list of all the substances or
# substances_all_list = pd_trips_encoded_normalized.loc[:, 'substance_mushrooms':'substance_syrian_rue'].columns.tolist()
# # [ 'mushrooms', 'lsd', 'cannabis', ... etc] for use as stopwords
# substances_all_plainstring_list = [substance.replace("substance_", "") for substance in substances_all_list]
# substances_all_plainstring_list = [substance.replace("_", " ") for substance in substances_all_plainstring_list]
# substances_all_plainstring_list
### Create simple wordclouds based on term frequency
# a list of the "classical psychedelics"
substances_psychedelics_list = ["substance_mushrooms", "substance_lsd", "substance_mescaline", "substance_5_meo_dmt", "substance_dmt"]
# NOTE: change this line of code to change which substances to plot
substances_to_plot = substances_psychedelics_list
num_plots = len(substances_to_plot)
print("Now plotting wordclouds for {} substances...".format(num_plots))
# creating subplots: https://stackoverflow.com/questions/25239933/how-to-add-title-to-subplots-in-matplotlib
# changing figure size: https://stackoverflow.com/questions/332289/how-do-you-change-the-size-of-figures-drawn-with-matplotlib
fig = plt.figure(figsize=(50, 10))
fig.suptitle("Word Clouds based on word Frequencies", fontsize=16)
for index, substance in enumerate(substances_to_plot):
# WordCloud also takes a stopwords = [] as argument; omitted.
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(substance_reports_map[substance])
ax = plt.subplot(num_plots, 1, index+1) # note here for subplots index must be [1, num_plots]
ax.set_title("{} Wordcloud".format(substance))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig('wordclouds_naive_psychedelics_tf.png')
plt.show()
### Plotting a single wordcloud
# wordcloud = WordCloud(stopwords=stopwords_all, max_font_size=50, max_words=100, background_color="white").generate(reports_text_tokenized["substance_mushrooms"])
# plt.figure()
# plt.imshow(wordcloud, interpolation="bilinear")
# plt.axis("off")
# plt.show()
from sklearn.feature_extraction.text import TfidfVectorizer
## Re create the TFIDF weighted document term matrix as before
pd_trips_encoded_normalized.dropna(subset=['report'], inplace=True)
pd_trips_encoded_normalized["substance_unique_label"] = pd_trips_encoded_normalized["substance_unique_label"].str.replace('.', '_')
norm_corpus = pd_trips_encoded_normalized["report"].to_list()
# the same procedure as above
tfidf_vectorizer = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
dtm_tfidf = tfidf_vectorizer.fit_transform(norm_corpus)
dtm_tfidf_nparr = dtm_tfidf.toarray()
vocab = tfidf_vectorizer.get_feature_names()
pd_report_vocab_tfidf = pd.DataFrame(np.round(dtm_tfidf_nparr, 2), columns=vocab)
pd_report_vocab_tfidf.head()
# obtain the tfidf vectors of the reports for which we wish to create word clouds
from random import seed
from random import choice
# for ease of analysis, we set a seed so the random reports are the same every time
# change seed to generate wordclouds of differente reports
# also see: https://machinelearningmastery.com/how-to-generate-random-numbers-in-python/
chosen_seed = 2**17
N_REPORTS_EACH_SUBSTANCE = 5 # number of reports to select for each substance
substance_reports_tfidf_map = {} # map {substance: {report: {vocab: tfidf weighting}}}
for substance in SUBSTANCES_OF_INTEREST_LIST:
# list of indexes corresponding to reports only containing mushrooms
index_substance_reports = pd_trips_encoded_normalized.index[pd_trips_encoded_normalized['substance_unique_label'] == substance].tolist()
# randomly select reports corresponding to each substance
selected_report_indices = []
for _ in range(N_REPORTS_EACH_SUBSTANCE): selected_report_indices.append(choice(index_substance_reports))
# Store tfidf vectors for each report
substance_reports_tfidf_map[substance] = {}
for reportindx in selected_report_indices:
substance_reports_tfidf_map[substance][reportindx] = pd_report_vocab_tfidf.iloc[reportindx].to_dict()
###### PLOTTING WORDCLOUDS BASED ON REPORT LEVEL TFIDF WEIGHTINGS ######
substances_to_plot = SUBSTANCES_PSYCHEDELICS_LIST + ["substance_mdma"]
num_substances = len(substances_to_plot)
# Plot the the various wordclodus
fig = plt.figure(figsize=(70, 60))
fig.suptitle("Word Clouds based on individual report TFIDF weightings", fontsize=16)
print("Plotting word clouds with seed {} for substances: {}\n".format(chosen_seed, substances_to_plot))
print("plotting {} rows (for number of substances) and {} columns (for num reports per substance)\n\n".format(num_substances, N_REPORTS_EACH_SUBSTANCE))
inx = 0
for index_substance, substance in enumerate(substances_to_plot):
for index_reportindx, reportindx in enumerate(substance_reports_tfidf_map[substance]):
# WordCloud also takes a stopwords = [] as argument; omitted.
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate_from_frequencies(substance_reports_tfidf_map[substance][reportindx])
# Subplot control: https://towardsdatascience.com/subplots-in-matplotlib-a-guide-and-tool-for-planning-your-plots-7d63fa632857
# print("plotting on ({}, {})".format(index_substance, index_reportindx))
ax = plt.subplot2grid((num_substances, N_REPORTS_EACH_SUBSTANCE), (index_substance, index_reportindx)) # note here for subplots index begins at 0
ax.set_title("{} Wordcloud".format(substance))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
inx += 1
# plt.savefig('wordclouds_naive_psychedelics_tf.png')
# plt.show()
## See TFidfVectorizer docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse.csr import csr_matrix # need this if we want to save tfidf_matrix; see https://stackoverflow.com/questions/34449127/sklearn-tfidf-transformer-how-to-get-tf-idf-values-of-given-words-in-documen
# NOTE: reports_text_tokenized stores the tokenized corpus
# List of substances under consideration, sorted to preserve alphabetical ordering
substances_list_sorted = sorted(list(substance_reports_map.keys()))
substances_list_sorted
# TfidfVectorizer expects a list of strings, so we prepare the corpus in the appropriate format
corpus_psychedelics_sorted = []
for substance in substances_list_sorted:
corpus_psychedelics_sorted.append(substance_reports_map[substance])
Note that by using TFidfVectorizer, we are also making the following assumptions. The following outlines the parameters of TfidfVectorizer for reference; additional information found in documentation
strip_accents=ascii helps us strip accented characters such as á to alowercase=True: automatically converts all the text to lowercaseanalyzer=word: considers words as the basic unigram unit; we can also do a character level model with charstop_words=english: uses built in stopwords given by sklearn; Follow these instructions to view all the stopwords for sklearn and nltkngram_range=(1, 1): setting to (1, 1) gives us unigrams. For bigrams and trigrams, we use (1, 2) and (1, 3) respectivelymax_df=1.0: using the float 1.0 sets the maximum document frequency to 100\% of the documentsmin_df=1: ignores all words with document frequency less than 1max_features=None: Don't put a cap on the number of featuresbinary=False: Use the tf counts, instead of setting all non-zero counts to 1dtype=float64: Default datatype for tf-idf valuesnorm=l2: uses the L2 norm for each row of tf-idf values, so each row sums to 1use_idf=True: use idf weighting (instead of just tf)smooth_idf=True: this is akin to laplace smoothing, and helps us prevent 0 divisionssublinear_tf=False: Do not replace tf with 1 + log(tf)# create and use a TfidfVectorizer
# We specify explicitly the parameters used, even for defaults, in order to be precise with our assumptions
# and for reproducibity
# note: sklearn by default throws away punctuations; we accept this as we are not performing sentiment analysis
# and related tasks where punctuations may be highly informative
# https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XeLZRzJKjmE
vectorizer_wc_tfidf = TfidfVectorizer(strip_accents='ascii',
lowercase=True,
analyzer='word',
stop_words='english',
ngram_range=(1, 1),
max_df=1.0,
min_df=1,
max_features=None,
vocabulary=None,
binary=False,
dtype='float64',
norm='l2',
use_idf=True,
smooth_idf=True,
sublinear_tf=False)
X_vectors_tfidf = vectorizer_wc_tfidf.fit_transform(corpus_psychedelics_sorted)
# vectorizer_tfidf.get_feature_names()
# Take a look again at the sorted ordering of the substances
# substances_list_sorted
# convert sparse tfidf matrix to np array
substance_term_matrix_tfidf_nparr = X_vectors_tfidf.toarray()
vocab = vectorizer_wc_tfidf.get_feature_names()
# create dataframe mapping substance to vocabulary
pd_substance_vocab_tfidf = pd.DataFrame(np.round(substance_term_matrix_tfidf_nparr, 2), columns=vocab, index = substances_list_sorted)
# pd_substance_vocab_tfidf.head()
# Create vocab: tfidf weighting maps for each substance, so we can use them in wordclouds
# map from {substance_<> : {vocab: tfidf weighting}}
substance_tfidf_map = {}
for substance in substances_list_sorted:
substance_tfidf_map[substance] = pd_substance_vocab_tfidf.loc[substance].to_dict()
### Create simple wordclouds based on TFIDF weights
# a list of the "classical psychedelics"
substances_psychedelics_list = ["substance_mushrooms", "substance_lsd", "substance_mescaline", "substance_5_meo_dmt", "substance_dmt"]
# NOTE: change this line of code to change which substances to plot
substances_to_plot = substances_psychedelics_list
num_plots = len(substances_to_plot)
print("Now plotting wordclouds for {} substances based on TFIDF...".format(num_plots))
# creating subplots: https://stackoverflow.com/questions/25239933/how-to-add-title-to-subplots-in-matplotlib
# changing figure size: https://stackoverflow.com/questions/332289/how-do-you-change-the-size-of-figures-drawn-with-matplotlib
fig = plt.figure(figsize=(50, 10))
fig.suptitle("Word Clouds based on substance TFIDF weightings", fontsize=16)
for index, substance in enumerate(substances_to_plot):
# WordCloud also takes a stopwords = [] as argument; omitted.
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate_from_frequencies(substance_tfidf_map[substance])
ax = plt.subplot(num_plots, 1, index+1) # note here for subplots index must be [1, num_plots]
ax.set_title("{} Wordcloud".format(substance))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig('wordclouds_naive_psychedelics_tf.png')
plt.show()
### Plotting a single wordcloud
# wordcloud = WordCloud(stopwords=stopwords_all, max_font_size=50, max_words=100, background_color="white").generate(reports_text_tokenized["substance_mushrooms"])
# plt.figure()
# plt.imshow(wordcloud, interpolation="bilinear")
# plt.axis("off")
# plt.show()
Recall in earlier discussions that as long as we can represent documents as vectors, we can use those vector reprsentations for word clouds, clustering, and other tasks. Wordclouds require some measure of importance of each word to a document — in other words, as long as we have a {word: real number} mapping for each document, we can construct word clouds. With this foundation, we can in principle use the outputs of other algorithms to construct word clouds. For example, we can use the following:
Tag: Future Directions
Previously we have discussed a variety of approaches based on using document vectorizations as starting points for generating wordclouds. While we will not go into the implementation, it is worth noting another possible approach based on work done by Chris Callison-Bruch.
This approach uses the Informative Prior Log Odds IPLO ratio method, which is used in natural language processing to identify the differences in language usage patterns between two groups. The foundations rests upon the log odds ratio, a widely used technique in statistical inference and hypothesis testing. Intuitively, we ask, how much more likely is a word to belong to a particular document, as opposed to in another document.
For those interested, the theoretical framework is outlined in more detail here, based on work done in Text Simplification.
In general, IPLO is designed to compare two groups (aka, documents with binary labels). In our case of multilabelled psychedelic trip reports, we can take a one-vs-all approach, i.e.: compare documents with the label substance.mushrooms and documents without the label substance.mushrooms. This will discover words / tokens that are particularly informative of the substance.mushrooms class. Note again that reports with a particular substance.* label are not necessarily uniquely substance.*, and may contain other labels since people have a tendency to consume multiple different types of drugs simultaneously.
For those interested, one can use the code from john-hewitt and Chris Callison Burch's research group, which can be obtained by emailing Chris Callison Burch Directly.
Tag: Future Directions
To make the wordclouds more interesting, we can add custom image masks to the wordclouds, as inspired by this article, this tutorial, and wordcloud package documentation. Here, we show a wordcloud of reports related to mushrooms in the shape of a mushroom. It is fascinating the top two words are "time" and "one"
## Plot wordcloud with image mask
# Make sure white parts are 255 instead of 0, as per: https://www.datacamp.com/community/tutorials/wordcloud-python
# TODO: Why is the wordcloud blurry, and why can I not get the mask to work appropriately?
# def transform_format(val):
# if val == 0: return 255
# else: return val
mask_mushroom = np.array(Image.open("images/util_mushroom_mask_color.png"))
# vectorization of transform_format function above; invert the mask to be in the right format
# Transform your mask into a new one that will work with the function:
# transformed_mask_mushroom = np.ndarray((mask_mushroom.shape[0], mask_mushroom.shape[1]), np.int32)
# for i in range(len(mask_mushroom)):
# transformed_mask_mushroom[i] = list(transform_format(mask_mushroom[i]))
mask_mushroom[mask_mushroom != 0] = 1
mask_mushroom[mask_mushroom == 0] = 255
# mask_mushroom
#plot a mushroom wordmap about mushrooms
#increase "resolution": https://stackoverflow.com/questions/28786534/increase-resolution-with-word-cloud-and-remove-empty-border
wordcloud_mushroom_masked = WordCloud(width=1000, height=10000, mask=mask_mushroom, max_font_size=50, max_words=100, background_color="black").generate(substance_reports_map["substance_mushrooms"])
# wordcloud_mushroom_masked = WordCloud(max_font_size=50, max_words=100, mask=mask_mushroom, background_color="white").generate(substance_reports_map["substance_mushrooms"])
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud_mushroom_masked, interpolation="bilinear")
plt.axis("off")
plt.savefig('wordcloud_naive_mushroom_masked')
plt.show()
Tfidfvectorizer output¶It's interesting to note that looking at the Tfidfvectorizer feature_names, we see many numbers. These numbers either stand alone or are followed by units and quantifiers.
These units seem to be used to specify multiple types of quantities, such as:
A closer examination of this numberical data may yield very interesting information about user demographics, set and setting, and dosage. All of this may be extremely valuable since dosage, set and setting, are among the most important factors in determining one's subjective experience when consuming a drug, and demographics may be correlated with the valence of a particular experience.
Tag: Future Directions
Returning to the discussion of how vectorizing documents, we note explictly that using bag of word methods such as term frequencies and tfidf weightings, each document is represented by a vector of size V, most of which is filled with zeros. This vector representation is extremely sparse and high dimensional (V dimensions)! Even dense word and document embeddings are often on the order of few hundred dimensions, which is also very high dimensional. Indeed, the concept of high dimensionality is not just specific to language; other data sources such as images (a 28 by 28 pixel black and white photograph is $28^2= 784$ dimension), videos, sound, fMRI data etc. are also high dimensional. Indeed, a substantial class of data in machine learning is high dimensional.
Why is high dimensional data interesting? Firstly, it is extremely hard to conceptualize, and essentially impossible to visualize the unprocessed data (how do you represent 784 dimensions on a 2D screen?). Notions of distances in high dimensions also become very non-intuitive, and it's difficult to fully appreciate how "vast" a 784 dimensional space truly is. The performance of algorithms also depend on dimensionality, with algorithms such as linear and logistic regression failing (without regularization) when number of features exceed number of observations, and the performance of various algorithms increasing first then decreasing with added dimensionality (Hughes Phenomenon). Because high dimensionality is so everpresent, the phenomena that arises only when analyzing high dimensional data is often called the "Curse of Dimensionality" (Bellman 1964)
"Clustering" refers to a broad class of unsupervise techniques that group observations together by some similarity metric. For example, we can cluster customers into different segments based on their spending habits or demographics, we can cluster images of brains based on particular features detected by an fMRI or we can cluster psychedelic trip reports they are feature vectors. As long as we have a vector representation of our observations, a distance metric such as the normal Euclidean distance (L2 Norm), we can use a variety algorithms (such as Agglomerative Hiearchical Clustering, K Means Clustering etc.) to cluster psychedelic trip reports together.
Clustering based on LDA features
As a simple example, we can cluster trip reports together based on their LDA features, in particular by using the document-topic matrix. This code has been commented out, but the reader is welcome to adapt and modify to examine the behavior of clustering algorithms.
# Fit LDA model to obtain document-topic matrix
# Adapted from: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41
# from sklearn.cluster import KMeans
# from sklearn.decomposition import LatentDirichletAllocation
# n_components = 3
# print("Fitting LDA model with {} components using TFIDF vectorization of documents...".format(n_components))
# lda = LatentDirichletAllocation(n_components=3, max_iter=10000, random_state=0)
# # document topic matrix
# print("Creating document-topic matrix...")
# dtopm = lda.fit_transform(dtm_tfidf)
# print("Rendering data frame with document-topic features")
# dtopm_features = pd.DataFrame(dtopm, columns=['T1', 'T2', 'T3'])
# dtopm_features.head()
# # due to long run time of LDA model, save to disk
# import pickle
# with open("dtopm_features_tfidf_3components", "wb") as out_file:
# pickle.dump(dtopm_features, out_file)
# # perform k means clustering with LDA features
# print("Now performing K-means clustering with LDA document-topic features...")
# km = KMeans(n_clusters=5, random_state=0)
# km.fit_transform(features)
# print("Finished clustering, now rendering data frame")
# cluster_labels = km.labels_
# cluster_labels = pd.DataFrame(cluster_labels, columns=['cluster_label'])
# pd_trips_encoded_normalized_clusters = pd.concat([pd_trips_encoded_normalized, cluster_labels], axis=1)
# print("Writing df with cluster labels to disk...")
# pd_trips_encoded_normalized_clusters.to_csv("pd_trips_encoded_normalized_clusters.csv", index = False)
# %%R
# R_trips_encoded_normalized_clusters <- read.csv("pd_trips_encoded_normalized_clusters.csv")
# glimpse(R_trips_encoded_normalized_clusters)
# R_trips_encoded_normalized_clusters.long <- R_trips_encoded_normalized_clusters.gather(substance_key, isPresent, substance_lsd:substance_UNK)
# # examine the substance distribution for each cluster
# R_trips_encoded_normalized_clusters.long %>%
# filter(isPresent == 1) %>%
# group_by(cluster_label, substance_key) %>%
# summarise(count = n()) %>%
# mutate(percent = round(count / sum(count) * 100, digist=1)) %>%
# arrage(desc(count))
# # TODO: Examine the top words associated with each cluster
# #### R template
# # R_trips_encoded_normalized_eda %>% filter(substance_lsd == 1) %>%
# # select(-substance_lsd) %>%
# # gather(substance_key, isPresent, substance_mushrooms:substance_UNK) %>%
# # filter(isPresent == 1) %>%
# # group_by(substance_key) %>%
# # summarise(count = n()) %>%
# # mutate(percent = round(count / sum(count) * 100, digits = 1)) %>%
# # arrange(desc(count))
Tag: Future Directions
PCA Theory
Principle Component Analysis (PCA) is an unsuperived technique whose chief purpose is to reduce the dimensionality of data, while sacrifing as little information as possible from the original dataset. When the data is reduced to one, two, or three dimensions, PCA has the added benefit of serving as a data visualization tool. Since our data is basically a set of vectors which span a vector space, the notion of "direction" in this vector space is well defined. Furthermore,just as data with only x and y values vary in either the x or y direction, or really along any angle (evident when plotted on a scatter plot), a data in (say) 100,000 dimensions also vary in each of those 100,000 directions, as well as all the directions in between. It turns out "reduce dimensionality while sacrficing as little information as possible" is equivalent to finding the directions in which the data varies the most. That is, if we were reducing our data to three dimensions, we need to find the three directions in the 3D space in which the data varies the most. These directions are called principle component loading vectors (James et. al, 2013).
Recall a direction can be expressed as a vector, which has magnitude and direction, Let our data be an nxp matrix (n observations, each with p featuers). Consider the linear combination of our original features $X_j$:
$$ Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \cdots + \phi_{p1} X_p$$The vector $\phi_1 = \{ \phi_{11}, \phi_{21}, \cdots , \phi_{p1} \}$ is typically called the first principle component loading vector, a p-dimensional unit vector which precisely defines the direction in our original feature space along which the data varies the most. The elements of $\phi_1$ are often called the loadings of the first principle component.
With this direction specified, we still need a way to represent all of the n points in our dataset. Indeed, we can project all the points (each of which are p dimensional vectors) onto $\phi_1$ by taking the dot product of each observation's p-dimensional vector and $\phi_1$. Since the dot product is a scalar, each point in our original dataset has been reduced to a single dimension, for a total of n values, one for each of our original data points. These n values, $\{ z_{11}, z_{21}, \cdots, z_{n1} \} = z_1 $ are called the scores of the first principle component. In this way, our p-dimensional dataset (n x p) has been dimensionality reduced to a single dimension (n x 1).
Now, PCA is not limited to reducing data down to a single dimension. We can reduce our data to $M$ dimensions by finding the first M principle component loading vectors $\phi_1, \phi_2, \cdots , \phi_M$, each of which are p dimensional vectors (James et. al, 2013). The first principle component loading vector, $\phi_1$ is the direction along which the data varies the most. The second principle component loading vector, $\phi_2$, is the direction orthogonal to $\phi_1$ along which the data varies the most. $\phi_3$ is the direction orthogonal to $\phi_1$ and $\phi_2$ along which data varies the most, so on and so forth, with each subsequent loading vector orthgonal to all previous ones and attempting the explain the most variance. (This orthogonality requirement is due to principle component key to PCA's optimality garuntees, and relates to Singular Value Decomposition,, which is outside the scope of this thesis; readers interested are invited to review James et. al, 2013.
With these $M$ principle component loading vectors, to reduce high dimensional data to $M$ dimensions, we project each of our $n$ data points onto each of the $M$ principle component loading vectors, such that each data point is now represented by an $M$ dimensional vector. The projected values onto the $m$-th principle component loading vector are called the scores of the $m$-th principle component (n x 1 vector).
Precisely, we refer to the linear combination $ Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \cdots + \phi_{p1} X_p$ with the highest variance as the first principle component of the set of features $X_1, X_2, \cdots , X_p$. PCA then finds the user-specified $M$ (orthonal) principle components that best describe the data, which minimizes reconstruction error (i.e.: sacrificing as little information as possible) and maximizes total variance explained (finding the loading vectors corresponding to the directions in which the data varies the most).
PCA Uses
We mention without demonstration that PCA can be used as another visualization tool to better understand our corpus of psychedelic trip reports. And more generally, this idea of projecting dataset onto a lower dimensional subspace spanned by some number of basis vectors (which could be PCA loading vectors, or any set of arbitrary basis vectors). One application is to visualize document vector embeddings by projecting them onto the first and second principle components. We can also visualize word embeddings (trained / fine tuned using our data set) using similar procedures. The principle component scores of each observation calcualated by PCA (at both the document and word level) can also be used as input feature vectors for other algorithms such as clustering, classification tasks. Due to the scope of the thesis, we will not dive too deeply into PCA, and rather earmark this technique as a possible future direciton.
Tag: Future Directions
In this section, we describe possible additional analysis using our corpus by exploring the classification problem, without going into in depth detail. The reader is invited to skip this section.
We have a corpus of psychedelic trip reports, where each report is labelled by discrete substance labels. As such, we can attempt to predict the substance(s) consumed given a partiular trip report. This opens up a whole world of possible techniques and feature engineering considerations that can be a topic of a separate thesis.
Typically a dataset prepared for classification has one discrete label per observation. There may be only two possible classes (e.g: sick vs. not sick; binary classification problem) or multiple possible classes (multi-class classification problem). It is fairly rare to have multiple discrete labels per observation. When there are more than two possible classes, we have a multilabel multiclass classification task (also called Multioutput-multiclass classification or multi-task classification). This is a fascinating problem that is outside the scope of the thesis; interested readers can check out algorithms from sklearn that handle multiclass and multilabel learning, which are generalizations of Trees, KMeans, and Random Forest.
We can also pre-process our substance labels to find the most co-consumed substances and make substance pairs their own labels. This would greatly expand the number of classes, but would reduce the multilabel multiclass classification task into just a multiclass classification task which is more well studied. These two formulations would be asking two different questions, one about whether a substance is present and the other treating co-consumed substance as a unitary experience. Also see multilabel classification examples from sklearn
Tag: Future Directions
Beyond traditional classification techniques such as those featured in sklearn, a very interesting direction for future research is to fine tune state of the art language models such as BERT and GPT-2 using a psychedelics trip report corpus for a classification task. Huggingface offers these transformer models such as BERT for fine tuning.
Tag: Future Directions
Speaking with friends and professors in the industry, it became apparent that one of the most time consuming aspects of machine learning and more specifically natural language processing in practice is selecting the appropriate algorithms, and then the hyperparameters for the models. For example, sklearn has more than 10 different classifers, each more suitable for various kinds of input data. Once a classification algorithm is selected, the hyperparameters related to both model definition (e.g.: model size, tree depth etc.) and learning (e.g.: number of iterations, learning rate etc.) still has to be tuned.
With these considerations, a compelling future direction is to dive more deeply into automatic NLP pipelines in production environments, to better understand the various feature extraction and hyperparameter tuning steps (e.g.: gridsearch?) A fairly recent release (in 2015) is AutoML, a framework leveraging scikit learn and automates hyperparameter selection. AutoML features "15 classifiers, 14 feature preprocessing methods, 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters" (Feurer et. al, 2015). AutoML features warmstarting (uses hyperparemteres that worked well previously for similar problems), bayesian optimization (uses random forest to predict optimal hyperparameters), and ensembles the 50 best classifiers (Feurer et. al, 2015). The interest user is invited to visit AutoML homepage and also check out an example of a feature extraction pipline from sklearn.
Tag: Future Directions
Another fascinating problem domain is that of text generation. What would it look like for the computer to automatically write trip reports?
A suitable problem formulation that corresponds to this problem is: "given a sequence of p characters (or words), predict the next character (or word)". The exact formulation gives rise to character level or word level models, each of which can also be unigram, bigram, trigram, or in general n-gram based. Readers interested in this problem domain are encouraged to research "Hidden Markov Models (HMM)", "Long Short Term Memory Networks" (LSTM's, see here for an excellent blog post by Chris Olah), "BERT" (from Google) and "GPT-2"(from OpenAI).
For a an example of generating Drake Lyrics with HMM's, Decision Trees, and LSTM's, see here.
Tag: Future Directions
One of the main concerns with machine learning is its black box nature. Especially as neural networks get deeper with current state of the art models (e.g.: Google's models with well over 1 billion parameters), it becomes extremely difficult to demystify the inner workings of these complex models, and understand sources of performance, bias, and shortcomings. If we are to better understand how language conveys our inner experiences, and use trip reports robustly in psychiatry, psychology, and neuropharmacology research (as well as developing appropriate neurotargets for neurotechnology), fdemystifying our machine learning algorithms become critically important.
LIME stands for "Local Interpretable Model-Agnostic Explanations", a technique developed by Marco Tulio Ribeiro, , Sameer Singh and Carlos Guestrin at the University of Washington for demystifying machine learning algorithms (Ribeiro, Singh & Guestrin, 2016). Taking a trained model and a particular observation as input, the process highlights the features of the observation that most informed the model's "decision making process".
Two examples from Ribeiro et. al's blog post makes this clear:

In this example of the tree frog, the top three classe predicted by the trained model (Google's Inception Model) of the given input image were tree frog, billiards, and hot air balloon. We see that LIME is able to highlight the portions of the input image most informed the model's predictions.
LIME can also be used for textual data, as seen by the following example (also from Ribeir et. al's blog post) of a random forest classifer predicting whether an email is Christian or athiest. The classifer achieved an accuracy of 92.4%, but we see through LIME it performed so well entirely for the wrong reasons (Ribeiro, Singh & Guestrin, 2016).

Why is this important? As medicine increasingly integrates machine learning processes, from classifying brain tumors (A, R, & S, 2019), dianosing skin cancer (M, 2019), and detecting glaucoma (Bowd & Goldbaum, 2008), to an FDA approved system alerting neurologists of acute stroke cases (Office of the Commissioner, 2019), the need to really understand what is going on under the hood can help save lives. In the field of psychiatry, we may use machine learning methods to better understand the dimensions of subjective experience and consiousness, from using language data (very promising preliminary work is happening at Oxford University at the moment; private communications) to brain imaging studies (R. L. Carhart-Harris et al., 2012). Indeed, Stanislav Grof once remarked, "Psychedelics, used responsibly and with proper caution, would be for psychiatry what the microscope is for biology and medicine or the telescope is for astronomy" (Stanislav Grof, Hofmann, & Weil, 2008). And Johns Hopkins and Imperial College London are leading the charge in better understanding psychedelic drugs, their therapeutic effects and its relationships to dimensions of consciousness experience. With the potential of psychedelic drugs and machine learning, we may have opportunities that are previously unimaginable to alleviate suffering and better support people suffering from "abnormal" conditions such as schizophrenia and psychosis. Currently ostrasized by society, people with schizophrenia and psychosis may hold the key to better understanding the state space of the mind, and the dimensions of non-ordinary states of consciouessness (Stanislav Grof, 2009).
With these (nascent but promising) possibilities, we must be extremely responsible with how we integrate technology, especially machine learning, into our study. Becoming intimately familiar with tools like LIME, and holding a thoughful and cautiously optimistic approach, might just be the key for us to co-create a future with less suffering.
In reflecting who has touched my life that led to the completing the thesis, the list becomes so long that it can fill volumes. I feel incredibly connected to the people who have shown kindness towards me, without whom I would not be here typing these words.
First and for most, I am deeply grateful for my parents, Susan Zeng and Lihong Zhao, who have been unwavering pillars for me for 23 years and counting, providing unconditional love through thick and thin, sadness and happiness. I can say for certain I would not have survived to this day without their infinite love, kindness, wisdom, understanding, and nourishment, much less complete this thesis. I am grateful for my sister, Megan Zhao, who has provided the appropriate amount of sibling rivalry, inside jokes, and support as a sister and dear friend.
I would also like to express my deep gratitude for Janis Ripoll Garriga, whose kindness, love, compassion, and presence is unparalleled. Multiple sleepless nights have been bearable only with your indefatigable support and motivation, which has kept the lights on metaphorically and literally. Thank you for being always being there for me, and for believing in me when I didn't even believe in myself. Thank you for the delightful memories, the present, and the future; I am so grateful that you are in my life, and I love you with all my heart.
Ken Kubota is one of the most special people in my life. I have Ken to thank for instilling in me deep foundational values of radical transparency, openness, and an unwavering mindful presence with people. Working with Ken for more than a year has been transformative for my psyche, outlook on life, and how I strive to be present with people. He is also the first person to make me aware of psychedelics, during one of the darkest times of my life. With his guidance, I became aware of John's Hopkin's research on psychedelics, which to put lightly, changed my life for the better. Thank you, Ken, for all that you are and all that you inspire.
I owe a deep debt of gratitude to professor Chris Callion-Burch and Lyle Ungar, who are among the best thesis advisors one can ask for. Professor Callison-Burch, thank you for being my first introduction to the exciting world of natural language processing, and for your enthusiasm, kindness, and understanding. I am so grateful for your support both inside the classroom and outside of the classroom, putting people before academics while helping me understand the kleidescope that is Natural Language Processing. Professor Lyle Ungar, thank you for hours of enlightening lectures and among the most enjoyable and inspiring conversations I have ever had at Penn. Your mind is one to behold, and the way your integrate vasts amount of information from seemingly disparate disciplines is absolutely delightful. Thank you both for your tireless support, midnight email communications, and inspiring ideas to keep me on the right track. Hopefully this thesis is passable!
Professor Linda Zhao is one of the most important mentors in my life, and is the role model who first taught me data science. Before meeting Linda, I had virtually no idea how to program in python, and had no knowledge of R. You have inspired in me a deep appreciation for the beauty of statistics and the promise of data science. Thank you for believing in me so deeply and treating me like one of your own. I will never forget your presence, wisdom, love, and compassion.
Yev Barkalov deserves a very special mention: he was instrumental in the completion of this thesis. While the thesis is independent work, hours of stimulating conversation with Yev laid the groundwork for EveryQualia, which was a key inspiration for this thesis. Yev also wrote the original script for obtaining the trip reports from Erowid, which was a crucial first step to the thesis. Beyond his insane work ethic, Yev has an incredibly active mind and can always be trusted to provide a nuanced view on anything, and he is a friend I would turn to for the rest of my life for laid back passtime and intense work sessions alike.
I would also like to thank Professor Sampath Kannan for being a teacher and advisor who has lent a hand when it counts. Professor Chris Murphy, one of the most understanding and foreward thinking professors I have ever met, provided invaluable support in ensuring my thesis was completed, and serves as an inspiration for how important conversations around mental health and diversity are to students, professors, and entire institutions. Professor Brittany Shields and Professor Hilary Gerstein, thank you for your passion and nuanced take on ethics, for expanding my perspective and reminding me that any engineering and science endeavor is inevitably linked to society, and the importance contextualizing studies through ethical and societal lens. Thank you Professor Max Mintz for being a strict and thoughtful advisor — your commitment to education has instilled me a respect for propriety. Thank you David Yaden for being a sounding board for psychedelics reearch from the very beginning. Your advice and perspective were instrumental to the formation of the Penn Society for Psychedelic Science (PSPS). And Desirae Caesar, what would the computer science department do without you? You are such an instrumental part of holding the department together, with your expert knowledge and devotion to students. I know I would not be graduating without your indefatigable support; your smile and laughter will be forever committed to my memory as a source of joy.
Daniel Colson, Anders Sanberg, Robin Hanson, David Champion, Luke Sabor, Alex K. Chen, Trize Stephen Pons, Amanda Ngo, and Quintin Freriches — these are among the most brilliant minds that I have come across. Thank you for hours and hours of stimulating conversation, for keeping me inspired and pushing me to explore new horizons. Every conversation with you is delightful — it is almost always unclear how our conversations would start and where they would go, but it is beyond reasonable doubt they will be fun and challenge my thinking in unexpected ways.
Thank you to Rick Doblin and the Multidisciplinary Association for Psychedelic Studies (MAPS) for providing invaluable advice and being an unwavering source of grounded inspiration. Without you, psychedelics would not be re-entering academia with such grounded fervor as a legitimate topic of research. Thank you Erowid, especially Fire Erowid and Earth Erowid, for working tirelessly to curate such a trusted and vast source of information, all provided for free to the benefit of human kind. Thank you Qualia Research Institute for the work you are doing to formalize consciousness research. Your insights and intuitions are so incredibly inspiring, and I haven't seen anyone else do the work you are doing.
Last but certainly not least, my sincerest and deepest gratitude for the Penn Society for Psychedelic Science (PSPS), the Intercollegiate Psychedelics Network (IPN), and everyone who is involved in this growing mycelial field of connectedness. Both PSPS and IPN are such magical entities that seem to hold promise for the next wave of psychedelic renaissance. Of particular note, Victor Acero, Rahul Sood, Emily Cribas, Margaret Fleming, Kenneth Shinozuka, Aidan Fitzsimmons, Elin Ahlstrand, and Wendi Yan are simply magical. Hours of laughter and conversation in trying to figure out how to grow the mycelial network together are ones I wouldn't trade for anything else. These people, along with so many other student leaders of IPN, are so hardworking, thoughtful, inspiring; these are people I would turn to for the most difficult questions and the craziest ideas, because I trust so deeply that no matter what they will hold space thougtfully, work comfortably with uncertainty, and share their brilliant minds with radical transparency.
cc:
University of Pennsylvania (Penn), Harvard University, Princeton University, University of Connecticut (UConn), Texas Tech University (TTU), University College London (UCL), UC Boulder, Ohio State University (OSU), California Institute of Integral Studies (CIIS), Georgia State University, Colorado State University-Fort Collins, UC Colorado Springs, Yale University, Stanford University, Stanford Business School, West Virginia University, University of Greenwich, University of Oxford
🍄The Mycelial Network
as of December 7, 2019
{ roughly in order of sporing }
Intercollegiate Psychedelics Network (IPN)
NOS OMNES MYCELIS
A, M. V., R, D. P., & S, M. S. (2019). Brain Tumor Classification Using Various Machine Learning Algorithms. >International Journal of Research in Arts and Science, 5(Special Issue), 258–270. >https://doi.org/10.9756/bp2019.1002/25
Atasoy, S., Donnelly, I., & Pearson, J. (2016). Human brain networks function in connectome-specific harmonic waves. >Nature Communications, 7(1). https://doi.org/10.1038/ncomms10340
Bellman, R., & Karush, R. (1964). Dynamic programming: a bibliography of theory and application (No. RM-3951-1-PR). >RAND CORP SANTA MONICA CA.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77. >https://doi.org/10.1145/2133806.2133826
BOWD, C., & GOLDBAUM, M. H. (2008). Machine Learning Classifiers in Glaucoma. Optometry and Vision Science, 85(6), >396–405. https://doi.org/10.1097/opx.0b013e3181783ab6
Carhart-Harris, R. L., & Friston, K. J. (2019). REBUS and the Anarchic Brain: Toward a Unified Model of the Brain >Action of Psychedelics. Pharmacological Reviews, 71(3), 316–344. https://doi.org/10.1124/pr.118.017160
Carhart-Harris, R. L., Erritzoe, D., Williams, T., Stone, J. M., Reed, L. J., Colasanti, A., … Nutt, D. J. (2012). >Neural correlates of the psychedelic state as determined by fMRI studies with psilocybin. Proceedings of the National >Academy of Sciences, 109(6), 2138–2143. https://doi.org/10.1073/pnas.1119598109
Carhart-Harris, Robin L., Leech, R., Hellyer, P. J., Shanahan, M., Feilding, A., Tagliazucchi, E., … Nutt, D. (2014). >The entropic brain: a theory of conscious states informed by neuroimaging research with psychedelic drugs. Frontiers >in Human Neuroscience, 8, 20. https://doi.org/10.3389/fnhum.2014.00020
Carhart-Harris, Robin L, Roseman, L., Bolstridge, M., Demetriou, L., Pannekoek, J. N., Wall, M. B., … Nutt, D. J. >(2017). Psilocybin for treatment-resistant depression: fMRI-measured brain mechanisms. Scientific Reports, 7(1). >https://doi.org/10.1038/s41598-017-13282-7
Doblin, R. (2002). A Clinical Plan for MDMA (Ecstasy) in the Treatment of Posttraumatic Stress Disorder (PTSD): >Partnering with the FDA. Journal of Psychoactive Drugs, 34(2), 185–194. >https://doi.org/10.1080/02791072.2002.10399952
Edsinger, E., & Dölen, G. (2018). A Conserved Role for Serotonergic Neurotransmission in Mediating Social Behavior in >Octopus. Current Biology, 28(19), 3136-3142.e4. https://doi.org/10.1016/j.cub.2018.07.061
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015). Efficient and robust >automated machine learning. In Advances in neural information processing systems (pp. 2962-2970).
Griffiths, Roland R, Johnson, M. W., Carducci, M. A., Umbricht, A., Richards, W. A., Richards, B. D., … Klinedinst, >M. A. (2016). Psilocybin produces substantial and sustained decreases in depression and anxiety in patients with >life-threatening cancer: A randomized double-blind trial. Journal of Psychopharmacology, 30(12), 1181–1197. >https://doi.org/10.1177/0269881116675513
Griffiths, R. R., Richards, W. A., McCann, U., & Jesse, R. (2006). Psilocybin can occasion mystical-type experiences >having substantial and sustained personal meaning and spiritual significance. Psychopharmacology, 187(3), 268–283. >https://doi.org/10.1007/s00213-006-0457-5
Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2005). Integrating topics and syntax. In Advances in >neural information processing systems (pp. 537-544).
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. >18). New York: springer.Chicago
Johns Hopkins University. (2019, September 4). Johns Hopkins launches center for psychedelic research. Retrieved >November 11, 2019, from The Hub website: https://hub.jhu.edu/2019/09/04/hopkins-launches-psychedelic-center/
Johnson, M. E. (2016, November 16). Principia Qualia – Opentheory.net. Retrieved December 23, 2019, from >Opentheory.net website: https://opentheory.net/2016/11/principia-qualia/
Luhn, H. P. (1957). A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal >of Research and Development, 1(4), 309–317. https://doi.org/10.1147/rd.14.0309
Manning, C. D., Raghavan, P., Hinrich Schütze, & University Of Cambridge. (2009). Introduction to information >retrieval. Cambridge: Cambridge University Press.
Mithoefer, M. C., Wagner, M. T., Mithoefer, A. T., Jerome, L., & Doblin, R. (2010). The safety and efficacy of ±3,4-methylenedioxymethamphetamine-assisted psychotherapy in subjects with chronic, >treatment-resistant posttraumatic stress disorder: the first randomized controlled pilot study. >Journal of Psychopharmacology, 25(4), 439–452. https://doi.org/10.1177/0269881110378371
Moreno, F. A., Wiegand, C. B., Taitano, E. K., & Delgado, P. L. (2006). Safety, Tolerability, and Efficacy of >Psilocybin in 9 Patients With Obsessive-Compulsive Disorder. The Journal of Clinical Psychiatry, 67(11), 1735–1740. >https://doi.org/10.4088/jcp.v67n1110
M, V. M. (2019). Melanoma Skin Cancer Detection using Image Processing and Machine Learning. International Journal of >Trend in Scientific Research and Development, Volume-3(Issue-4), 780–784. https://doi.org/10.31142/ijtsrd23936
Nichols, D. E. (2016). Psychedelics. Pharmacological Reviews, 68(2), 264–355. https://doi.org/10.1124/pr.115.011478
Office of the Commissioner. (2019). FDA permits marketing of clinical decision support software for alerting >providers of a potential stroke in patients. Retrieved December 23, 2019, from U.S. Food and Drug Administration >website: https://www.fda.gov/news-events/press-announcements/fda-permits-marketing-clinical-decision-support->software-alerting-providers-potential-stroke
Osório, F. de L., Sanches, R. F., Macedo, L. R., dos Santos, R. G., Maia-de-Oliveira, J. P., Wichert-Ana, L., … >Hallak, J. E. (2015). Antidepressant effects of a single dose of ayahuasca in patients with recurrent depression: a >preliminary report. Revista Brasileira de Psiquiatria, 37(1), 13–20. https://doi.org/10.1590/1516-4446-2014-1496
Pahnke, W. N. (1963). Drugs and mysticism: An analysis of the relationship between psychedelic drugs and the mystical >consciousness: A thesis (Doctoral dissertation, Harvard University).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any >classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining >>(pp. 1135-1144). ACM.
Rogan, J. (2018, June 29). The Joe Rogan Experience: The Human Erowid - Hamilton Morris - Podcast Notes. Retrieved >December 23, 2019, from Podcast Notes website: https://podcastnotes.org/2018/06/28/morris/
Sewell, R. A., Halpern, J. H., & Pope, H. G. (2006). Response of cluster headache to psilocybin and LSD. Neurology, >66(12), 1920–1922. https://doi.org/10.1212/01.wnl.0000219761.05466.43
SPARCK JONES, K. (1972). A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL. Journal >of Documentation, 28(1), 11–21. https://doi.org/10.1108/eb026526
Stanislav Grof. (2009). LSD : doorway to the numinous : the groundbreaking psychedelic research into realms of the >human unconscious. Rochester, Vt: Park Street Press.
Stanislav Grof, Hofmann, A., & Weil, A. (2008). LSD psychotherapy. Ben Lomond, Calif.: Multidisciplinary Association >For Psychedelic Studies.
Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., & Mueller, A. (2015). Scikit-learn. GetMobile: >Mobile Computing and Communications, 19(1), 29–33. https://doi.org/10.1145/2786984.2786995
Wallach, H. M. (2006, June). Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference >on Machine learning (pp. 977-984). ACM.Chicago
Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, >18(1), 45–55. https://doi.org/10.1177/016555159201800106
Zellig Sabbettai Harris. (1954). Distributional structure. New York: Verlag Nicht Ermittelbar.
Towards Data Science:
sklearnsklearn, super cool
~/opt/anaconda3/lib/python3.7/site-packages/jupyter_contrib_nbextensions/nbextensions/toc2 path to nbextensions toc2# ### Create simple wordclouds based on term frequency
# # a list of the "classical psychedelics"
# substances_psychedelics_list = ["substance_mushrooms", "substance_lsd", "substance_mescaline", "substance_5_meo_dmt", "substance_dmt"]
# # NOTE: change this line of code to change which substances to plot
# substances_to_plot = substances_psychedelics_list
# num_plots = len(substances_to_plot)
# print("Now plotting wordclouds for {} substances...".format(num_plots))
# # creating subplots: https://stackoverflow.com/questions/25239933/how-to-add-title-to-subplots-in-matplotlib
# # changing figure size: https://stackoverflow.com/questions/332289/how-do-you-change-the-size-of-figures-drawn-with-matplotlib
# fig = plt.figure(figsize=(50, 10))
# fig.suptitle("Word Clouds based on word Frequencies", fontsize=16)
# for index, substance in enumerate(substances_to_plot):
# # WordCloud also takes a stopwords = [] as argument; omitted.
# wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(substance_reports_map[substance])
# ax = plt.subplot(num_plots, 1, index+1) # note here for subplots index must be [1, num_plots]
# ax.set_title("{} Wordcloud".format(substance))
# plt.imshow(wordcloud, interpolation="bilinear")
# plt.axis("off")
# plt.savefig('wordclouds_naive_psychedelics_tf.png')
# plt.show()
# ### Plotting a single wordcloud
# # wordcloud = WordCloud(stopwords=stopwords_all, max_font_size=50, max_words=100, background_color="white").generate(reports_text_tokenized["substance_mushrooms"])
# # plt.figure()
# # plt.imshow(wordcloud, interpolation="bilinear")
# # plt.axis("off")
# # plt.show()
# # helper functions to extract keywords from documents
# # from http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XeLcAzJKjmE
# def sort_coo(coo_matrix):
# tuples = zip(coo_matrix.col, coo_matrix.data)
# return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
# def extract_topn_from_vector(feature_names, sorted_items, topn=10):
# """get the feature names and tf-idf score of top n items"""
# #use only topn items from vector
# sorted_items = sorted_items[:topn]
# score_vals = []
# feature_vals = []
# # word index and corresponding tf-idf score
# for idx, score in sorted_items:
# #keep track of feature name and its corresponding score
# score_vals.append(round(score, 3))
# feature_vals.append(feature_names[idx])
# #create a tuples of feature,score
# #results = zip(feature_vals,score_vals)
# results= {}
# for idx in range(len(feature_vals)):
# results[feature_vals[idx]]=score_vals[idx]
# return results
# # Tutorial on how to extract key words based on tf-idf: http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XeLcAzJKjmE
# # Using the tf-idf frequencies
# feature_names_tfidf = vectorizer_wc_tfidf.get_feature_names()
# sorted_items=sort_coo(X_vectors_tfidf.tocoo())
# #extract only the top n; n here is 10
# keywords=extract_topn_from_vector(feature_names_tfidf,sorted_items,10)
# # now print the results
# print("\n=====Doc=====")
# # print(doc[:1000])
# print("\n===Keywords===")
# for k in keywords:
# print(k,keywords[k])
# # # get the first vector out (for the first document)
# # first_vector_tfidfvectorizer= X_vectors_tfidf[0]
# # # place tf-idf values in a pandas data frame
# # df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=feature_names_tfidf, columns=["tfidf"])
# # df.sort_values(by=["tfidf"],ascending=False)
Other ways of Visualizing Trip reports:
K-means and minibatch K-means; see sklearn examplePrinciple Components Analysis (PCA) of trip reports# TODO: Example code from: https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
# Lars Buitinck
# License: BSD 3 clause
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import logging
from optparse import OptionParser
import sys
from time import time
import numpy as np
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
dest="n_components", type="int",
help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
action="store_false", dest="minibatch", default=True,
help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
action="store_false", dest="use_idf", default=True,
help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--use-hashing",
action="store_true", default=False,
help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
help="Maximum number of features (dimensions)"
" to extract from text.")
op.add_option("--verbose",
action="store_true", dest="verbose", default=False,
help="Print progress reports inside k-means algorithm.")
print(__doc__)
op.print_help()
def is_interactive():
return not hasattr(sys.modules['__main__'], '__file__')
# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
# #############################################################################
# Load some categories from the training set
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
# Uncomment the following to do the analysis on all the categories
# categories = None
print("Loading 20 newsgroups dataset for categories:")
print(categories)
dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()
labels = dataset.target
true_k = np.unique(labels).shape[0]
print("Extracting features from the training dataset "
"using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
if opts.use_idf:
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', alternate_sign=False,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
else:
vectorizer = HashingVectorizer(n_features=opts.n_features,
stop_words='english',
alternate_sign=False, norm='l2',
binary=False)
else:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)
X = vectorizer.fit_transform(dataset.data)
print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()
if opts.n_components:
print("Performing dimensionality reduction using LSA")
t0 = time()
# Vectorizer results are normalized, which makes KMeans behave as
# spherical k-means for better results. Since LSA/SVD results are
# not normalized, we have to redo the normalization.
svd = TruncatedSVD(opts.n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
X = lsa.fit_transform(X)
print("done in %fs" % (time() - t0))
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(
int(explained_variance * 100)))
print()
# #############################################################################
# Do the actual clustering
if opts.minibatch:
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
verbose=opts.verbose)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
% metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, km.labels_, sample_size=1000))
print()
if not opts.use_hashing:
print("Top terms per cluster:")
if opts.n_components:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
## TODO: Example code from sklearn, see: https://scikit-learn.org/stable/auto_examples/plot_multilabel.html#multilabel-classification
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import CCA
def plot_hyperplane(clf, min_x, max_x, linestyle, label):
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(min_x - 5, max_x + 5) # make sure the line is long enough
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.plot(xx, yy, linestyle, label=label)
def plot_subfigure(X, Y, subplot, title, transform):
if transform == "pca":
X = PCA(n_components=2).fit_transform(X)
elif transform == "cca":
X = CCA(n_components=2).fit(X, Y).transform(X)
else:
raise ValueError
min_x = np.min(X[:, 0])
max_x = np.max(X[:, 0])
min_y = np.min(X[:, 1])
max_y = np.max(X[:, 1])
classif = OneVsRestClassifier(SVC(kernel='linear'))
classif.fit(X, Y)
plt.subplot(2, 2, subplot)
plt.title(title)
zero_class = np.where(Y[:, 0])
one_class = np.where(Y[:, 1])
plt.scatter(X[:, 0], X[:, 1], s=40, c='gray', edgecolors=(0, 0, 0))
plt.scatter(X[zero_class, 0], X[zero_class, 1], s=160, edgecolors='b',
facecolors='none', linewidths=2, label='Class 1')
plt.scatter(X[one_class, 0], X[one_class, 1], s=80, edgecolors='orange',
facecolors='none', linewidths=2, label='Class 2')
plot_hyperplane(classif.estimators_[0], min_x, max_x, 'k--',
'Boundary\nfor class 1')
plot_hyperplane(classif.estimators_[1], min_x, max_x, 'k-.',
'Boundary\nfor class 2')
plt.xticks(())
plt.yticks(())
plt.xlim(min_x - .5 * max_x, max_x + .5 * max_x)
plt.ylim(min_y - .5 * max_y, max_y + .5 * max_y)
if subplot == 2:
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.legend(loc="upper left")
plt.figure(figsize=(8, 6))
X, Y = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=True,
random_state=1)
plot_subfigure(X, Y, 1, "With unlabeled samples + CCA", "cca")
plot_subfigure(X, Y, 2, "With unlabeled samples + PCA", "pca")
X, Y = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=False,
random_state=1)
plot_subfigure(X, Y, 3, "Without unlabeled samples + CCA", "cca")
plot_subfigure(X, Y, 4, "Without unlabeled samples + PCA", "pca")
plt.subplots_adjust(.04, .02, .97, .94, .09, .2)
plt.show()
`TODO`
- Baseline: multinomial Bayes and other techniques; see sklearn [classifying text documents](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py)
- Baseline: Logistic Regression:
- Coefficients returned by logistic regression can be used as weights to scale words in worclouds
- TODO: How do we handle the multilabel case?
- Appropriate regularization like LASSO / Ridge (Elastic Net) for feature selection
- Baseline: Naive Bayes classifier
- BERT fine-tuning
- [Huggingface](https://github.com/huggingface/transformers) + transformers - See Chris Callison-Burch’s recommendations
- Handling `multilabel learning`, see
# # Explore the dataset trips.csv
# import csv
# printed = 0
# with open('trips.csv') as f:
# csv_reader = csv.reader(f)
# for line in csv_reader:
# if printed < 2:
# print(line)
# printed += 1
# else:
# break
brew install pythonbrew update python Need to update xcode command line tools after every major / minor os upgrade: xcode-select --install; see here
Issues with installing wordcloud
Anacoda and pipbrew link <packageName>; see hereautocompletion to jupyter notebook with this tooltoc2 to workTODO: Putting a pin in this for now rip
The following sections were originally planned for the thesis. A decision was made to omit them in this submission, with plans of publishing them at a later date, to be made available on alextzhao.io and paradiseinstitute.org.
# Settings for nbviewer; can safely ignore
# %%html
# <script src="https://cdn.rawgit.com/parente/4c3e6936d0d7a46fd071/raw/65b816fb9bdd3c28b4ddf3af602bfd6015486383/code_toggle.js"></script>